Chapter 11: Replication and Consistency
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Dive.
Our mission here is simple.
We take the most complex blueprints of modern technology, distill the essential engineering and give you the competitive edge of being thoroughly informed.
Today we are focusing on what I think is really the bedrock of distributed data systems,
replication and consistency models.
For anyone trying to build or operate a scalable database,
these concepts, how we make sure multiple copies of data are reliable, are the crucial final puzzle piece before you can truly grasp advanced topics like consensus algorithms or atomic commitment.
Okay, let's unpack this a bit.
We're moving beyond just thinking about a single node crashing.
We're looking at how a system has to behave when your data is spread out across, I don't know, dozens or even thousands of machines.
Exactly.
So our goal today is to really understand the contract between the system and the user, what engineers call visibility semantics.
And this contract, it determines exactly how these multiple constantly changing copies of data behave when they're being read and written to across a whole distributed network.
And the fundamental reason for all this, for this entire deep dive, is to achieve fault tolerance.
A system has to, I mean must, continue operating correctly and reliably without any data corruption.
Even when its components inevitably fail, it's the core guarantee of resilience.
And really the only mathematical way to achieve that kind of fault tolerance is through redundancy.
You have to eliminate single points of failure and you do that by keeping multiple copies of the data available.
And we see this redundancy handled in, let's say, two primary ways in distributed systems.
First, there's the explicit reconfiguration model.
Okay, like a primary replica setup.
Precisely.
You have a clear primary and a set of replicas.
If that primary fails, the system has to pause.
It has to explicitly promote one of the replicas to be the new master and then reestablish the data flow.
And the secondly.
The second is where systems handle consistency implicitly.
They don't rely on a single master.
Instead, they coordinate and collect responses from multiple participants during every single query.
It's that coordination that keeps the data consistent.
So what for you, the listener, is just huge here, especially when you start thinking about global infrastructure.
It's not just about surviving a single server crash anymore.
Not at all.
Replication drastically increases availability, which means the system can withstand failures as large as an entire data center going offline.
And on top of that, it dramatically reduces latency.
Georeplication lets you place data physically closer to your end client, which makes everything feel faster.
So when we dissect these systems, we need to be watching three critical events that really define the system's behavior.
That's right.
The complexity of the consistency model comes down to these three things.
First, there's the write operation initiated by the client.
Second, the replica update, which is the internal process of propagating that right across the network.
And third, the read operation initiated by another client trying to observe the data.
So the timing and the ordering of these three events and the rules we enforce on that sequence, that's what we are about to explore in depth.
That's the whole game.
You know, the push for ultra -high availability isn't just some technical aspiration.
It literally defines the modern digital economy.
It really does.
I mean, if you are running an e -commerce platform, a banking system, or any essential communication infrastructure,
downtime directly translates to losing customers, losing money, and just complete disruption of service.
High availability is just not negotiable.
That's right.
And because high availability requires that redundancy we just talked about, we are immediately thrown into what is probably the most difficult problem in distributed computing,
synchronization.
How do you keep all those copies perfectly in sync?
Exactly.
And this whole challenge is just perfectly framed by the infamous CKAPI conjecture, which was first articulated by computer scientist Eric Brewer.
Ah, CAP, consistency, availability, and partition tolerance.
It's arguably the most cited and maybe the most misunderstood theme in computer science.
Definitely misunderstood.
So let's make sure we nail down precisely what those three terms mean within the context of the conjecture itself.
Let's start with availability A.
In the CAP definition, it's a very theoretical measure.
It means that every request made to a non -failing node must eventually result in a successful response.
But the fine print there is key, isn't it?
It is.
What's often overlooked is that the original CAP definition places, well, no bounds on the latency of that response.
So in real world engineering, a response that takes 30 seconds is technically available, but it's practically useless to a user.
That's a key distinction.
Okay, now let's tackle consistency C.
This is where the confusion really peaks because this is not the C in ACID.
Absolutely not.
And that's critical.
The C in ACID refers to transactional consistency.
The idea that a transaction has to follow all the rules and invariants like referential integrity and leave the database in a valid state.
But the C in CAP is way stronger.
Way stronger.
It's defined as atomic or linearizable consistency.
This requires that the distributed system appears to the client as if it is running on a single instantaneous machine.
Any operation has to succeed or fail in its entirety, and all observers have to see the state change at the same logical moment.
And finally, partition tolerance P.
This isn't really a feature you choose.
It's more like an unavoidable reality.
It's just a fact of life in distributed systems.
Partition tolerance is the system's ability to survive a state where the network splits, meaning nodes or groups of nodes just cannot communicate with each other.
This is unavoidable in any real world asynchronous distributed system.
Okay.
So the conjecture then states,
in the presence of this network partition, you cannot simultaneously guarantee both strong consistency C and full availability A.
You have to choose one.
So if that network split happens, the core engineering choice really just boils down to the CP versus AP choice.
Right.
Let's consider a CP system first.
These systems prioritize consistency above absolutely everything else.
Like a consensus algorithm.
A classic example, yeah.
A database using a consensus algorithm like Raft or Paxos.
These algorithms fundamentally require a majority or a quorum of nodes to agree on a state change to move forward.
So if a network partition splits a five node cluster into say a two node side and a three node side.
The two node side is out of luck.
It can't form a majority.
So it fails requests.
It would rather fail than risk accepting a right that the majority might later overwrite or invalidate.
And the AT systems, they take the complete opposite view.
They prioritize staying up, even if it means serving data that might be a little stale.
Exactly.
AP systems prioritize availability.
Imagine an e -commerce checkout page.
If the price database is partitioned, you definitely don't want the user to see a 503 error.
An AP system might be designed to always accept writes or serve reads as long as any single replica is up.
But that means you could get divergence.
Oh, absolutely.
During the partition, one side might serve a stale product inventory count or accept a right that the other side of the partition never receives.
This inevitably leads to data divergence, but crucially, the user experience remains available.
The promise is that the divergence will be reconciled eventually.
What's fascinating and something we have to clarify is that CAP is often oversimplified.
People draw it as a triangle and say pick two, but there's a ton of nuance here that really affects how we build systems.
The first critical nuance is that CAP discusses network partitions, not just simple node crashes.
A node crash is easy.
The node just stops responding.
But a node that is partitioned from the cluster can still be perfectly operational.
It can accept writes, it can serve reads to local clients.
But its data is now out of sync with the rest of the world.
Exactly.
You have a consistency problem even though all your nodes are technically up.
And the idea that you can tune P is also a bit misleading.
You can't.
You cannot tune partition tolerance because it's an external property.
It's determined by the chaos of the network.
If a cable gets cut or a router fails, P happens.
The only choice you have is between A and C when P happens.
And we already hammered on the difference between KP's consistency and ACID's consistency and KP's availability versus real -world availability.
That's why researchers extended the model to PACELEC.
This acknowledges the limitations of KP.
So in the presence of partitions, PTP, you still choose between C or A.
But what about the LSE case when the network is perfectly healthy?
You still have a trade -off.
You still have a choice.
And it's between latency and consistency.
The overhead of maintaining strong consistency, all that coordination, the message passing, waiting for acknowledgments, it adds inherent latency, even when the network is perfect.
Strong consistency is expensive all the time, not just during a failure.
So an engineer designing a globally distributed system is constantly making these choices that impact their user experience.
And to make that choice less of a binary fail -everything or serve -junk data, we have these relaxed guarantees.
Right, which shifts the focus from a hard pass -fail to quantifiable metrics like harvest and yield.
Okay, what do those mean?
These metrics allow for what's called graceful degradation.
Harvest defines how complete the query result is.
Imagine a massive sharded search index.
A search request needs to hit 100 shards to get a full result.
But what if five of those shards are unavailable?
A strict system would fail the whole query.
But a system focused on harvest would say, you know what, I'll return the 95 rows I found rather than failing completely.
It's sacrificing completeness, the harvest, to maintain responsiveness.
And yield is the ratio of successful requests to attempted requests.
Exactly.
Yield is a measure of robustness that's different from simple uptime.
A node can be up, but if it's overloaded and starts timing out requests, its yield is low.
The crucial trade -off is often sacrificing harvest, allowing incomplete data to maximize yield, ensuring the system stays responsive for the maximum number of clients, even if the quality of service is a little bit reduced.
It gives engineers a much more fine -grained way to manage the system's behavior under stress.
Much better than a simple binary CQAPP choice.
Okay, before we dive headfirst into the really complex consistency models, we need to establish a kind of foundational language.
How do we even conceptualize data when it's spread across this vast distributed landscape?
We rely on a concept called the illusion of shared memory.
A distributed system, a good one at least, hides the complex reality of internode communication and asynchronous message passing.
It presents to the programmer a much simpler model.
The model of just interacting with a single shared storage space.
Exactly.
And this shared space is conceptually modeled as an array of simple storage units, which we call registers.
A register is just the most basic unit you can read from or write to.
And every operation on these registers is defined by two crucial points in time.
The moment of invocation and the moment of completion.
Right.
And if an operation P1 completes a time, two per dollars, and then a later operation P2 starts its indication at T, and T -tol happens before T -tol, well, the operations are sequential.
There's no ambiguity there.
But all the difficulty, all the complexity, comes when the operations overlap, when they are concurrent.
Right.
If their time bounds overlap, they're concurrent.
The obvious case is if P1 starts, and then P2 starts and finishes entirely inside P1's execution window.
But there's a more subtle case that's really important to visualize.
Go on.
P2 can start after P1 begins its invocation.
But if P2 manages to finish before P1 completes its own execution, they're still considered concurrent.
Their time boundaries overlap, and it's that overlap that forces us to define all these consistency rules.
And how a register handles those overlapping concurrent operations dictates its guarantees, starting with the weakest one,
the safe register.
Safe registers are mostly theoretical.
They're pretty impractical for serious data storage.
They give you the absolute minimum guarantee.
Which is what?
If a read happens sequentially after a write, it has to return the new value.
But, and this is the big but,
if a read happens concurrently with a write, the read can return an arbitrary value from the register's range.
For binary values, this means the result might just flicker unpredictably between the old and new value.
It's not reliable.
Okay, so moving up in strength and a little more useful, we have regular registers.
Regular registers provide a stronger guarantee.
A read operation is guaranteed to return one of two things.
Either the value from the most recent write that successfully completed before the read began,
or the value from any write operation that is currently overlapping with that read.
So it's not totally random anymore.
Not random, but the visibility isn't globally consistent at the same moment.
You have some notion of temporal order based on completion, but it's still a bit fuzzy.
What's a good system analogy for that?
Think of a common master -worker replication setup, but without strong synchronous coordination.
The master accepts a write that's P1 completing.
It then asynchronously replicates that change to a set of workers.
A read, P2, that hits a worker might see the result of P1 immediately, or it might hit a worker that hasn't gotten the update yet and is still serving the old state.
Because P1 completed before P2 started, P2 must eventually see P1's result, but it might not be instantaneous.
The timing is non -linearizable.
Which brings us to the gold standard for a single register, the ideal we strive for when we really want simplicity, the atomic register.
Atomic registers guarantee linearizability.
This is the strongest single register guarantee you can get.
It ensures that every operation, read or write, appears to have a single, definitive, instantaneous shift in value.
All operations before that point see the old value, and all operations after see the new value.
It just gets rid of all that ambiguity of concurrent reads.
Completely.
It fundamentally simplifies how you reason about the system state, because it behaves exactly like a single, non -distributed memory slot.
All right, so now we take that conceptual model of a single register, and we try to apply it to a multi -process distributed system.
And this is where the difficulty just skyrockets.
Because there's no global clock.
Exactly.
There's no absolute global synchronized clock.
Because of network delays and, you know, relativity,
different processes inevitably have different views of the state and different perceptions of time.
We sometimes call this a special relativity theory for data.
And consistency models are the rules we use to bring order to that chaos.
They're the formal contracts.
We impose them to restrict the possible outcomes, the possible histories of the system.
And every constraint in that contract comes with a cost, a performance tax, if you will.
A very real tax.
We're talking about latency from coordination messages,
CPU cycles for locking, network IO just for message passing.
The tighter the consistency, the higher that tax.
If we could ignore physics and cost, what would be the ideal?
The theoretical idea would be strict consistency.
The fictional instant propagation.
Precisely.
Under strict consistency, any write by any process is instantly available for any subsequent read globally across the entire universe.
If write X1 happens at location A at time $2 on any read X at location B at any time $2,
just slightly greater than $2 one on has to return one.
But the speed of light says no.
Speed of light makes this physically impossible to implement in any real distributed system.
So what's the strongest practical model that engineers can actually build?
That would be linearizability.
We already introduced it with atomic registers, but when you scale it up to a whole distributed system, it guarantees that the effects of an operation become visible to all readers exactly once at a single atomic point, which occurs sometime between that operations invocation and its completion.
It creates a total order of events.
A global total order.
It respects the order of operations from a single client, and it enforces an ordering even among concurrent parallel operations.
Crucially, it provides two key guarantees.
It prohibits stale reads, and it requires all reads to be monotonic.
Meaning once you see a value, you can never see an older one.
Never.
Once you observe a value, no subsequent read anywhere in the system can ever observe an older value.
To make this concrete, let's walk through that classic visibility illustration.
You have three processes.
P1 writes one, P2 writes two, and P3 is just reading.
All while P1 and P2 are running, P1 starts its write a little bit before P2.
So when P3 performs its first read, both P1 and P2's writes are still in progress.
They are concurrent and incomplete.
So the system hasn't decided the order yet.
Exactly.
Because neither operation has reached its atomic linearization point, P3's first read can still return the initial value.
Let's call it empty set.
It could also return one or two, depending on how that read operation gets ordered against the internal state transition of P1 or P2.
The outcome is not yet final.
Okay, now P1's write, X1 completes.
This is before P3's second read.
But P2's write is still in flight.
Right.
Since P1 is complete, the read cannot return empty set anymore.
The past has been decided.
It must return at least one, because that's the state of the last completed write.
It could, however, still return two if the read gets linearized against the in -flight P2 write.
So the completion of P1 sets a new floor for what's an acceptable state.
It establishes a lower bound, yes.
And finally, by the time P3 does its third read,
P2's write, X2, has also completed.
And since P2 started after P1, its write is ordered after P1's.
Therefore, the third read must deterministically return two.
Because all operations have completed and P2 followed P1 in real time, the total order is now established as $1 to 22.
This strict adherence to real -time boundaries is the defining feature of linearizability.
And this all hinges on that one concept, the linearization point.
It does.
It's the conceptual moment where the operation appears to happen instantaneously, and its effects become visible to the whole world.
And it has to happen somewhere between the start and end of the operation's actual execution time.
It has to.
If you look at the time axis, the operation starts at invocation and ends at completion time.
The linearization point must fall somewhere in that window.
Before the point, everyone sees the old value.
After the point, everyone sees the new value.
The whole job of the distributed system is to agree on exactly where that point falls for every single operation.
How do you even do that in practice?
How do you make something appear instantaneous across dozens of servers?
Well, it involves heavy synchronization.
You use consensus algorithms like RAF to manage an ordered log of operations.
At the micro level, it often relies on primitives like locks or complex instructions like compare and swap, CAS.
Ah, CAS, which tries to write a new value only if the current value matches what you expect.
Right, but CAS famously opens the door to the ABA problem, which is a huge engineering hazard.
Can you explain that?
It's deceptive.
Imagine a CAS instruction expects to see the value A.
While it's waiting to execute, another concurrent operation comes in, changes the value from A to B, and then changes it back to A.
So when the CAS runs, it sees A and thinks nothing has changed.
It succeeds, but the system state has been fundamentally modified in the meantime.
The simple presence of A doesn't guarantee the integrity of the state or whatever data structure it's guarding.
It just highlights how incredibly difficult it is to ensure integrity across asynchronous nodes.
Given all this complexity and cost, why is linearizability still considered the gold standard?
Its main advantage is simplicity for the programmer.
A system built entirely of linearizable objects is itself guaranteed to be linearizable.
That property is called composability, and it greatly simplifies the task of verifying that your system is correct.
But the costs are huge.
Massive.
CPU traffic for sync, cache invalidations, network IO for coordination.
This is why so many high -performance systems reluctantly step down in their consistency guarantees.
Speaking of implementation, we should highlight RAFL, the reusable infrastructure for linearizability.
It's designed for linearizable remote procedure calls, RPCs.
What's the engineering trick there?
RAFL addresses two big problems, sequencing and idempodence.
To handle sequencing, clients are assigned unique IDs, and they use monotonically increasing sequence numbers for every request.
But how do you make sure that client ID is reliable, especially if a client crashes?
That's where leases come in.
A system -wide service issues a time -bound lease to the client.
If the client fails or stalls, its lease expires.
The system then knows that any operations attempted with that ID can't be committed.
Clients have to periodically renew these leases to stay active.
And how does it handle idempodence?
Preventing a retried RPC from executing twice.
That's solved by the completion object.
When RAFL successfully executes an operation, it stores the result, the completion object, durably, right alongside the data itself.
If the client retries the RPC with the same client ID and sequence number, RAFL sees that existing completion object, skips re -executing the operation, and just returns the stored result.
It ensures the operation is effectively executed only once.
If linearizability is the ultimate goal, but it's way too expensive for many systems, the first step down the ladder is sequential consistency, SC.
How does that relax the rules?
SC keeps the requirement that all operations are ordered as if they were executed sequentially.
And crucially, each individual process's own operations must appear in the order they were submitted.
But the big relaxation is?
The big relaxation is that SC does not respect the real -time order.
The total order that the system agrees upon can be arbitrarily stale from the perspective of a global physical wall clock.
So we're giving up wall clock reality in exchange for just having some total logical order that everyone agrees on.
Precisely.
The classic diagram, figure 11 to 5, shows this.
P1 writes 1, then P2 writes 2, so P1's operation finished first in real -time.
Under sequential consistency, a reader might observe the sequence as $2 to a 1.
Which violates the real -world timeline.
It does.
But SC is maintained as long as all observers, P3 and P4, both see that same agreed -upon total order, which is $2 to a 1.
The system maintains internal consistency among observers, even at the cost of real -time accuracy.
That makes the distinction really clear.
And importantly, SC is not composable like linearizability.
Correct.
And if even sequential consistency feels too rigid, if requiring a single global total order is too much, we can relax the model even further and focus only on dependency.
This brings us to causal consistency.
The model where only cause and effect matters.
Exactly.
Under causal consistency, only operations that are causally related, where one operation explicitly depends on the outcome of a previous one, have to be seen in the same order by all processes.
Any rights that are concurrent and have no direct dependency can be observed in different orders by different processors.
Okay, let's contrast the two examples here.
In the non -causal scenario, P1 writes 1 and P2 writes 2 concurrently.
Right.
No dependency.
So P3 can see the order as $1 to 22, while P4 can simultaneously see it as $2 to a 1.
Both views are totally permissible.
It's great for availability.
But now consider the causal scenario.
P1 writes 1, then P2 reads the value 1, and based on that result, P2 writes 2.
Now there's a dependency.
P2's operation is causally dependent on P1's.
This dependency has to be tracked, usually with a logical clock or some version metadata.
And because of that dependency, the system now has to make sure that both P3 and P4 observe the order $1 to 22.
It has to.
Even if P2's update message travels faster across the network, the system has to buffer P2's update until P1's proceeding, causal dependency arrives and is applied locally.
And that dependency tracking is the implementation cost of causal consistency.
It is.
Systems like CopyS track dependencies using key versions, while Eiger tracks the operation order itself.
But what you get is a strong guarantee that an application sequence of reads and writes will always appear consistent with its own logic, even if the rest of the world sees concurrent operations happening differently.
Okay, so managing that complex partial causal history requires more than just simple time stamps.
One of the most critical data structures for this is the vector clock.
Vector clocks are, they're really ingenious.
They let us simulate a kind of common time among asynchronous nodes and, more importantly, they let us establish a partial order between events.
How'd it work?
Each process maintains a vector of logical clocks.
There's one slot in the vector for every process in the system.
So for a three -node system, A, B, and C, everyone starts at $0.
For a product system, exactly.
If node A performs a local write, it increments only its own slot, so its vector becomes one of your dots.
Now, when node A sends a message or replicates its state to node B, it sends its full vector $1.
Node B, upon receiving this, updates its own local vector.
Let's say B was at $1 by taking the maximum clock value for each slot it's seen so far.
So B merges them, and its new vector is $1.
That vector now encodes the causal history of events that B has seen.
And this allows the system to build a causal chain, but its most powerful use is in detecting divergent histories.
That's the key insight for eventually consistent systems, yes.
When a replica gets an update, it compares the incoming vector clock with its local one.
If the incoming vector is strictly greater in all slots, it means it causally succeeds the local version, and you can just apply it.
But what if it's not?
What if they're incomparable?
If A's vector is greater in some slots but B's vector is greater in others, it indicates a conflict.
A conflict meaning two processes independently wrote to the same data item without having seen each other's updates first.
The history is diverged.
Exactly.
That diagram, figure 11 to 9, shows this well.
Node 1 saw history $1, $5, $7, $80.
Well, Node 2 saw $1, $5, $3, $3.
The vector clocks flag this divergence as needing resolution.
And it's important to note, the vector clock only detects the conflict.
How you resolve it, with application logic, last right wins, whatever, is a separate mechanism.
Okay, now let's pivot away from this global state and focus entirely on the end user's experience.
This is where session models, or client -centric consistency, come in.
Right.
This approach recognizes that the client has a reasonable expectation of continuity, regardless of which back -end replica they happen to connect to.
It tackles that common problem where you update your profile, hit refresh, and your update seems to have vanished because you got routed to a stale replica.
That's the exact problem it solves.
It focuses entirely on the sequence of operations for a single client, assuming those operations are sequential from that client's point of view.
And session models define four core guarantees to make the experience feel coherent.
What's the first and most fundamental one?
Read -own writes.
This means any read operation following a write by the same client must observe that updated value.
If I post a comment, when I immediately refresh the page, I have to see my own comment.
The system has to track my write context and enforce that visibility for me.
Number two is monotonic reads.
This stops time from going backward.
Yes.
If a client reads value V, all subsequent reads from that same client must observe a value that is at least as recent as V.
You can't read a new value and then, on a subsequent request, fall back to reading an older, stale value.
And number three, monotonic writes.
This means writes from the same client have to propagate and be applied across the system in the order they were executed by that client.
If I change my status to V1 and then immediately to V2, the system has to make sure replicas apply V1 before V2 to prevent the resurrection of the old V1 value.
And the fourth one, writes follow reads, which is basically session -level causality.
It is.
If a client reads V1 and then performs a subsequent write V2, the system has to ensure that the V2 write is ordered logically after the V1 write that the client observed.
This preserves the cause and effect relationship from the client's perspective.
When you combine monotonic reads, monotonic writes, and read -own writes, you get what's known as pipelined RAM, PRAM consistency.
Or FIFO consistency, yeah.
PRAM is a really solid, practical model.
It guarantees that writes from the same process propagates in order, but it allows writes from different processes to still be observed in different orders by different readers.
It's a foundational stepping stone for building usable, high -availability systems without the enormous cost of full linearizability.
We've established that the stronger the consistency model, the higher the cost, sometimes forcing a system to choose to fail during a partition.
This brings us into the realm of more relaxed models, and the most forgiving of them all is eventual consistency.
Eventual consistency is the ultimate high -availability choice.
Updates propagate asynchronously.
They might be late.
They might be out of order.
They might arrive through totally different network paths.
The formal definition says that if no new updates happen, the system will eventually converge and all accesses will return the latest written value.
That word eventually is what makes engineers so nervous.
It gives you no hard time bound.
If eventually means tomorrow, that's a problem.
It's a huge trade -off.
Eventual consistency sacrifices the guarantee of immediate consistency for maximum performance and availability.
It explicitly allows replica states to diverge temporarily.
So all the complexity gets pushed onto the conflict resolution layer?
Entirely.
That layer has to reconcile these divergent states using mechanisms like last right wins or vector clocks once communication is finally restored.
When we commit to being eventually consistent, we gain this really crucial ability to manually tune the balance between availability and consistency using the famous n, w, and r variables.
Right.
These three variables let engineers define exactly how durable and consistent their operations need to be.
n is the replication factor, the total number of nodes storing the data.
w is the right consistency, the number of nodes that have to acknowledge a right for it to be considered complete.
And r is the read consistency, the number of nodes that must respond to a read request for it to be successful.
And this is where we get that critical formula for achieving strong guarantees, even in an eventually consistent system.
If r plus w is strictly greater than n, you get consistency.
Let's walk through that with an example.
Suppose our replication factor n is 3.
We could use w2 and r2.
So r plus w is 4, which is greater than n, which is 3.
Exactly.
This guarantees that the read set and the write set will always overlap by at least one node.
The system writes to all three nodes, but only waits for two acknowledgments to return success.
When reading, it asks all three, but waits for two responses.
Since the successful write needed two nodes and the read needs two nodes, those two sets have to intersect.
That guarantees the reader will encounter the most recent successful write value.
This also shows we can tolerate one node failure and still maintain that guarantee.
And the beauty is you can tune these variables for your workload.
Absolutely.
If you have a write -heavy system, like logging sensor data, you might choose w1 and rn.
Which would be r3 in our example.
Right.
This makes writes super fast.
The system only needs one local node to acknowledge it.
But your reads become really expensive and fragile because you have to contact all three replicas to guarantee you get the latest value.
And conversely, for a read -heavy system where consistency is vital.
You might choose w1n and r1.
This makes writes very slow because they have to wait for all n replicas to confirm.
But once that write succeeds, you're guaranteed that every single replica has the latest version.
So reads are incredibly fast.
You can just pull the value from any single node.
That trade -off is the essence of tuning.
The concept of a quorum is basically just a specific majority -based version of this tuning, isn't it?
It is.
A quorum is defined as any set of nodes greater than half the total.
And two euros.
Using quorums ensures that any two quorums, a read quorum and a write quorum, will always intersect, which preserves that consistency invariant.
It's a powerful, mathematically safe way to tolerate up to five failures in a system with $2 dollars plus a Y of the nodes.
But it has its own risks.
If you lose a majority of your nodes, say two out of three, the system loses quorum and has to stop all reads and writes even if one node is perfectly healthy.
And, critically, quorum reads alone don't guarantee monoconicity.
If a write succeeds with w2 out of n3, but that third replica fails to get the update, subsequent reads could alternate between the old and new value depending on which two nodes respond.
You often need to implement techniques like blocking read repair to fix that, which adds back the latency you're trying to avoid.
To reduce the storage cost of high replication but still get the availability benefits of quorums, systems sometimes use witness replicas.
This is a really clever optimization, especially for geo -replication, where storage is expensive.
Instead of storing a full copy of the data on every replica, you split your n nodes into copy replicas, which store the full data, and witness replicas, which only store lightweight metadata saying a write occurred.
So you could have a five node system, but only three are full copy replicas, and two are just cheap witness replicas.
Precisely.
And if two of your copy replicas fail, the witness replicas can be temporarily upgraded to store the necessary metadata, helping you maintain the quorum.
You get the same availability as five full copies, but with lower storage cost, as long as you follow two strict rules.
What are they?
First, you still have to use majority quorums, $200 plus and 11 participants.
And second, and this is the crucial one, at least one replica in any quorum must always be a copy replica.
You have to ensure you're always accessing the actual data, or at least a majority record that points to where the data is.
We've covered the expensive, super strong end with linearizability, and the flexible, but sometimes uncertain end with eventual consistency.
But there's this really elegant middle ground that a lot of modern collaborative apps use.
Strong eventual consistency, SEC.
SEC is really the golden path for applications like shared documents or collaborative tools.
It allows updates to be late, asynchronous, or out of order, but, and this is the key, it guarantees that when all the updates finally do arrive, any conflicts can be resolved deterministically.
So every node converges to the exact same valid final state, automatically?
Automatically, without any manual intervention.
And the key technology that makes this mathematical certainty possible is conflict -free replicated data types, or CRDTs.
They are magnificent.
They are specialized data structures that are designed mathematically to preclude conflicts from ever happening in the first place.
Which means operations can be applied in any order without changing the final result.
Exactly.
This makes them ideal for systems that are designed to diverge temporarily during network partitions because they know they will reconcile themselves perfectly later.
Replicas can execute operations locally instantly without needing any synchronous coordination.
Let's talk through a classic example.
The grow -only counter, gcounter.
This is a commutative RDT, meaning the order of operations doesn't matter.
The gcounter is beautifully simple.
Its state is a vector of numbers, and each server is only allowed to modify its own slot in that vector.
So for three nodes, the state starts at zero dollars.
If node one increments its counter, its state becomes one another dollar.
So if node three increments its counter, its state becomes zero dollars, zero dollars.
Okay, now when node one and node three replicate their states to each other, they have to merge.
And the merge function is simply taking the maximum value for each corresponding slot.
So node one merges its one dollars dollars with the incoming zero, zero dollars goal, and its new state becomes max one zero zero, max zero one, one to two one, which is one dollar.
And the final counter value is the sum of that vector, which is two.
Right.
And because that merge operation only relies on picking the highest value for each slot,
it is inherently commutative.
It doesn't matter if node one merged first or node three merged first, the final state will be identical.
And if you need a counter that can go both up and down.
You use the PN counter.
It just extends the concept by maintaining two separate vectors.
A P vector for increments, positive, and an N vector for decrements, negative.
You use the same max merge function on both vectors separately.
And the final value is the sum of P minus the sum of N.
These CRDT principles also apply to simple storage, like registers.
Yeah, the common one is the last right wins, the LWW register.
It stores a value alongside a unique, globally ordered time stamp.
The merge operation just picks the value with the largest time stamp.
Since time is absolute, the merge is commutative and deterministic.
But LWW throws away data.
It does.
It discards concurrent rights.
If you can't afford to do that, you have to use a multi -value register, which keeps a list of all the conflicting values and requires your application to supply its own merge logic.
Finally, let's look at sets.
Adding to a grow -only set, Gset, is commutative.
But how do you handle removals without creating conflicts?
To support both adds and removals, you need a clever dual structure.
You maintain two distinct sets,
an addition set and a removal set.
And there's a critical rule here.
A critical invariant.
You can only add a value to the removal set if it has already been observed in the addition set.
Why is that so important?
Because if you allow the removal of an element that hasn't arrived in your local addition set yet,
maybe because of a network partition, you might remove something that was never truly there to begin with.
By requiring that the addition must causally precede the removal, you guarantee that when all the updates eventually propagate, the final state, which is just the addition set minus the removal set, will be identical everywhere.
Okay.
Let's unpack all of that and synthesize the spectrum of data guarantees we've charted today.
We moved from the theoretical ideal all the way down to the highly practical.
Right.
Starting at the highest bar, we had linearizability, instantaneous visibility, enforcing a real -time order, super expensive, needs consensus, but it's fully composable.
Just below that, we found sequential consistency.
You still get a total order that everyone agrees on, but you gain efficiency by sacrificing that strict adherence to the wall clock.
Then we got more flexible with causal consistency, using logical clocks to ensure only causally related operations must be ordered, which allows concurrent updates to diverge.
And finally, the core of client -centric systems, PRAM -FIFO consistency, guaranteeing that rights from the same source are ordered, while rights from different sources can propagate asynchronously.
We also covered the importance of managing the client view through session models, ensuring guarantees like read -own rights and monotonic reads.
And we looked at server -side tuning with the NWR variables, where a core majority, plus WN1, gives you strong consistency guarantees, often supported by witness replicas to save on storage.
And maybe the most forward -looking tech we discussed was CRDTs, which enable strong eventual consistency.
By designing data structures to be mathematically immune to conflicts, they provide incredible availability and performance, even during long network partitions.
The final system insight here for you, the engineer, is that understanding these underlying consistency models is just non -negotiable.
It really is.
If you build an application that relies on causal consistency, but you deploy it on a database that only guarantees eventual consistency, you are introducing unrecoverable, non -deterministic bugs.
The consistency contract has to be understood and respected at every single layer of the architecture.
So we noted that sequential consistency allows operations to be reordered against the wall clock time, while linearizability enforces that strict real -time boundary.
This brings us back to the ultimate engineering challenge.
If you have a massive, geo -replicated, deployment -spanning continence, where network latency is constantly shifting, what specific practical tools or techniques must an engineer use to convince the system that a second is the same length everywhere, ensuring true global linearizability?
How staggeringly expensive is that illusion to maintain?
That illusion?
It requires a radical departure from traditional asynchronous computing, and its cost is just astronomical.
To enforce global real -time linearizability, engineers have to use extremely specialized hardware.
Like what?
Specifically, atomic clocks, like the ones used in Google's Spanner database, which runs the TrueTime API.
So we're literally talking about dedicated physical infrastructure in the data center.
Absolutely.
TrueTime uses GPS receivers and specialized atomic clocks in every single data center to guarantee a small bounded error interval on time across their entire global fleet.
The system doesn't know the exact time.
Instead, it guarantees that the current time falls within a known tiny interval, epsilon.
And how does that help?
By making transactions wait.
By ensuring they only commit after the earliest possible starting time of any previous conflicting operation has passed, they can achieve global linearizability.
But it requires that expensive hardware, constant maintenance, and it introduces mandatory waiting periods into every transaction to accommodate that time uncertainty.
It is the very definition of a high synchronization cost.
The ultimate price for eliminating uncertainty.
A truly profound engineering choice.
Indeed.
It really defines the bleeding edge of distributed system design.
Thank you for joining us for the Deep Dive.
We hope this exploration gives you a much clearer understanding of the consistency landscape.
We hope you gain some valuable insights.
Catch you next time.
β This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML β₯Related Chapters
- Digital Wallet System DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Replication in Distributed Data SystemsDesigning Data-Intensive Applications
- Ad Click Event Aggregation DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Anti-Entropy and DisseminationDatabase Internals: A Deep Dive into How Distributed Data Systems Work
- Bacterial Genome Replication & Gene ExpressionPrescott's Microbiology
- Consistency Models & Consensus AlgorithmsDesigning Data-Intensive Applications