Chapter 9: Consistency Models & Consensus Algorithms

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today, we are taking on what might be the single hardest problem in software engineering, building reliable systems out of fundamentally unreliable components.

We're going deep on consistency and consensus,

the mathematical bedrock required to make distributed databases behave like they're running on one perfect machine.

We've got a stack of research here pulled directly from our source material on distributed systems.

And our mission today is to sort of extract those general purpose guarantees.

The high level abstractions that let us ignore all the chaos of dropped packets, flaky clocks and node crashes.

That's the entire game, isn't it?

I mean, earlier in the world of databases, we learned about the transaction abstraction acidity properties, which lets developers build reliable logic without worrying about hardware failure or concurrent access.

But when you move to a distributed system, you need an even stronger set of guarantees to hide the nastiness of network faults.

And the core of that guarantee, the thing that makes fault tolerance possible is consensus.

It's the ability for these widely separated nodes to reliably agree on a single outcome.

Okay, let's unpack that because agreement sounds simple, but it's crucial.

Think about something like single leader replication.

If the leader node dies, the remaining nodes have to agree on which one of them becomes the new leader.

If they don't agree and you get two or more nodes thinking they're the leader, you hit a split brain scenario.

And that's just catastrophic data corruption.

Exactly, avoiding split brain is really what our entire journey today is building up toward.

But to really appreciate the solution, we have to start at the bottom of the consistency ladder where things are weakest,

that's eventual consistency or maybe a better term is convergence.

This is the weakest guarantee systems like Dynamo or many multi -leader setups offer.

The promise is if you start writing data and just wait,

eventually all the nodes will converge on the same value.

Right, but that phrase unspecified period of time, that's the operational nightmare right there.

This extreme uncertainty is the cause of all these subtle persistent bugs that drives engineers completely mad.

If I write a value and then milliseconds later I try to read it back, I might hit a replica that just hasn't gotten the update yet.

I see stale data and that's eventual consistency in action.

It basically puts the burden on the programmer to handle all those timing issues.

Precisely, and to escape that manual complexity, we look for the gold standard, the strongest model there is, linearizability.

Sometimes it's called atomic or strong consistency.

The entire point of linearizability is to create this perfect illusion.

The illusion that there is only one copy of the data in the entire universe.

It gives you an absolute recency guarantee.

If a write completes, any read that starts after that, regardless of which replica it hits, must return that new value.

The classic example of a linearizability violation really drives this home.

It's the one with Alice and Bob watching the FIFA World Cup final.

Right, so imagine Alice refreshes the score on her phone.

She sees the final score, say, one zero.

She shites it across the room to Bob.

Bob hears this, immediately refreshes his own phone, but his query hits a lagging replica in some other data center.

His phone still shows the score at zero zero.

Game's still ongoing.

That's a clear violation.

Because Alice's statement created a causal link.

Bob initiated his request after Alice's observation, so he expected his result to be at least as recent as hers.

The system failed to give him that single copy illusion.

So technically what that means is if client A finishes a write, and then client B successfully reads that new value, then from that exact moment forward, every other client CDE, no matter where they are, must see that new value, or an even newer one.

The system has to act as if the value just instantly flipped at one atomic moment in time.

And this is where we need a quick clarification to avoid some common confusion.

Linearizability is often mixed up with serializability.

Serializability is an isolation property for transactions that span multiple operations, ensuring those transactions look like they ran in some sequential order.

Linearizability, on the other hand, is purely a recency guarantee for a single object.

So they're only related in the sense that the absolute strongest guarantee, which is called strict serializability, combines both of them.

You need the multi -object isolation of serializability and the single object recency of linearizability to get there.

And it's important to note that popular mechanisms like serializable snapshot isolation, or SSI, are not linearizable.

Right, because SSI reads from a snapshot.

It's consistent within itself, but it could be stale compared to the absolute latest right somewhere else in the network.

So why do we put ourselves through all this pain and cost for linearizability?

Where is it absolutely non -negotiable?

Well, there are three big areas.

First is locking and leader election, which we've already mentioned.

If you can't get linearizability, you can't guarantee one and only one leader.

You get split brain.

The second area is just critical for data integrity,

constraints, and uniqueness.

Think about making sure only one person can register a specific username.

Or making sure a bank balance never drops below zero.

That's a linearizable compare and set operation.

You have to check the current value and write the new one atomically, and that check has to be against the single most recent truth.

Exactly.

And the third area is the subtlest one.

It involves these cross -channel timing dependencies.

Take a common setup.

A web server runs a new image file to a storage system and then immediately sends a message on a separate queue to an image resizer service.

The resizer gets the message, tries to fetch the image from storage.

If that storage isn't linearizable, the message queue, that separate channel, might just be faster than the data replication.

So the resizer gets the instruction, tries to fetch the file, but it hits a replica that hasn't seen the image yet, which violates the very clear causal dependency.

The message was sent after the file write completed.

Now you have a resizer that's just waiting forever, or worse, it fails and creates a permanent inconsistency.

The file exists, but the thumbnail doesn't.

And implementing this is really tough.

I mean, single leader replication can work if all reads and writes go through the leader, but you still have to worry about correctly identifying that leader during a failover.

Multi -leader and leaderless systems, like Dynamo -style ones, are generally not linearizable.

Asynchronous replication just inherently breaks the recency guarantee.

Even with strict quorums where W plus R is greater than N, it doesn't save you because of variable network delays.

You can still get a stale read and you're leaving with a newer write.

Which brings us to the inevitable trade -off.

If linearizability is the holy grail, why isn't everyone using it for everything?

This is where the infamous CAP theorem comes in.

It frames the trade -off.

When network partition happens, and it's an inevitable part of distributed systems, you have to choose.

Consistency meaning linearizability or availability.

Right, but the crucial critique of CAP that the sources highlight is that the choice isn't just pick two of three.

Network partitions will happen.

The real choice is do we prioritize linearizability and make the system parts that are cut off from the majority unavailable?

Just error out.

Or do we sacrifice that consistency to make sure the system stays available, even if it means risking inconsistent data?

And here's maybe the most important insight about the cost of linearizability.

It's slow, and it's slow all the time, not just during partitions.

Why is that?

Because to achieve that single -copy illusion, the response time of any operation, even a simple read, is dictated by the uncertainty of the network delay to your furthest replica.

You cannot respond until you're absolutely certain that no other node in the network is currently servicing a write that might contradict your result.

The slowest possible link in your network sets the floor for your latency.

So weaker consistency models are just inherently faster because they don't have to wait for that global certainty.

Right, so if linearizability is often too expensive because it's fighting the network and demanding this global single timeline,

where do we find a strong, but still performant, middle ground?

That brings us back to the most fundamental concept underpinning all of this, ordering.

And we care about ordering because it directly relates to causality.

The principle that a cause must precede its effect.

An operation that happens because of knowledge of a previous operation is causally dependent on it.

You shouldn't be able to update a database row that hasn't even been created yet.

And if a system successfully obeys that ordering imposed by causality, we call it causally consistent.

This is a really powerful middle ground because linearizability implies causality.

It's strictly stronger.

But causal consistency doesn't necessarily have that CIP trade -off penalty.

It gives you strong guarantees about dependencies while still being available during network partitions.

The key difference is really in the structure of the timeline.

Linearizability demands a total order.

A single, unambiguous timeline where every single operation can be compared to every other.

Causality, though, it only defines a partial order.

Concurrent operations, the ones that happen without knowledge of one another, are just incomparable.

The timeline can branch and merge.

It's a lot like concurrent branches in a version control system like Git.

And to track that partial order, we need a way to manage sequence number ordering.

We often use logical clocks for this, not physical time of day clocks, which we know are unreliable.

Lamport timestamps are the classic method here.

So a Lamport timestamp is just a simple pair counter and a node ID.

To make sure the total order is consistent with causality, every node and client includes the maximum counter value they've ever seen on every request they send.

The simple mechanism ensures that if operation A caused operation B, B will always have a higher timestamp.

So we get a total order, which is great.

But here's the catch.

And this leads us to our next big requirement.

Lamport's total order is retrospective, not prescriptive.

Imagine two users concurrently trying to register the same unique username.

They both get timestamps.

A node receiving one of those requests can't immediately decide if it succeeded or failed.

Why not?

Because it has no idea what other concurrent requests with a lower timestamp might be delayed somewhere out in the network.

The final true total order only emerges after you've collected all the operations.

You can't just stall transaction waiting for the entire uncertain network.

So this raises a really important question.

If our logical clock only gives us a history of events, we need something that imposes the order upfront in real time so we can make safe decisions.

How do we finalize that order in a fault -tolerant way?

The answer is total order broadcast.

Total order broadcast, or you might hear it called atomic broadcast, is a protocol that forces two critical safety properties.

First, reliable delivery, so no messages are lost.

And more importantly, totally ordered delivery.

Every single message or operation is delivered to every single node in the exact same sequence.

This is how you define a single fixed log, which is absolutely essential for implementing state machine replication, the fault -tolerant way of running a single leader database.

And this concept is incredibly powerful because total order broadcast is mathematically equivalent to solving linearizable storage.

If you use this fixed agreed upon log order, you can implement that compare and set operation you need for things like uniqueness guarantees.

The first operation to appear in the log wins, that's it.

It doesn't matter when it was submitted.

Now, achieving this kind of agreement often involves something we've seen before historically,

distributed transactions and atomic commit.

The atomic commit problem is simple to state.

Ensure a transaction that spans multiple services or nodes either commits everywhere or aborts everywhere.

It maintains ACD atomicity.

And the most common historical attempt to solve this is the two -phase commit algorithm, or 2PC.

It uses a single coordinator node.

Phase one is the prepare phase.

The coordinator asks all the participants if they can commit.

This is the point of no return.

If a participant says yes, they have to guarantee they can commit later, freezing whatever resources they need.

Phase two is the commit or abort phase, where the coordinator logs its decision and then tells the participants to make the changes permanent.

The fatal flaw of 2PC and the reason modern systems avoid it is coordinator failure.

If the coordinator crashes after the participants have voted yes, but before it sends the final commit instruction, the participants are just stuck.

They're stuck forever in an in doubt or uncertain state.

They can't unilaterally abort because the coordinator might've decided to commit right before it died.

And they can't unilaterally commit because it might've decided to abort.

The operational nightmare here is that the transaction is just blocked.

Those resources, the rows, the locks, they're frozen indefinitely.

You have to wait for the coordinator to recover and read its log to figure out what the decision was.

This makes 2PC a blocking protocol.

It makes the whole system incredibly brittle and unavailable in the face of failure.

And that operational failure is exactly why we move on to the final necessary hurdle,

fault tolerant consensus.

This is the formal solution to the agreement problem.

Consensus requires nodes to agree on one proposed value and it has to satisfy four properties, uniform agreement, integrity, validity, and crucially, termination.

That's the guarantee that the process eventually finishes even if nodes fail.

We should probably acknowledge the famous theoretical speed bump here, the FLP impossibility result, which states that no consensus algorithm can guarantee termination in a truly asynchronous system if even one node might crash.

But practical systems like Raft, Paxos, and Zab get around this, so.

They get around the asynchronous model by using things like timeouts and failure detectors.

They don't guarantee that consensus will always terminate, but they do guarantee that if the system is stable for long enough, it will terminate.

And these algorithms, they directly implement total order broadcast.

They agree on a whole sequence of values, not just a single one.

So we've arrived at the paradox we started with.

Single leader replication needs consensus to handle failover, but the consensus algorithms themselves seem to rely on electing a leader.

How do you break that paradox?

Needing a leader to elect a leader in a fault -tolerant way.

The core technique is epoch numbering, combined with overlapping quorums.

Epoch numbers provide a total order for leadership disputes.

The higher number always wins.

And crucially, a leader has to get votes from a majority quorum, not just to get elected, but also before deciding on any proposal.

And why must those quorums overlap?

What safety property are we protecting with that?

The overlap ensures that if a leader successfully commits a value in one epoch, when the next leader tries to get a quorum in the next epoch, their quorum must contain at least one node that was part of the previous leader's quorum.

And that overlap guarantees that the new leader will see all the committed values from the past, ensuring they can never make a conflicting decision.

It protects safety across leadership changes.

This complexity is exactly why we have coordination services, things like ZooKeeper and et cetera.

They aren't general databases.

They are specialized services that run these complex consensus protocols on a small, fixed cluster.

They provide the fault -tolerant building blocks you need, things like linearizable atomic operations, LOX, total ordering via sequence numbers, often called fencing tokens, and reliable failure detection.

And this whole journey leads to a really profound equivalence that researchers discovered.

Linearizable comparison set registers, atomic transaction commit, and total order broadcast are all mathematically equivalent to solving the consensus problem.

If you can solve one of them fault -tolerantly, you can solve them all.

It's the fundamental equation that ensures safety and consistency in any complex distributed system.

So to quickly recap this whole deep dive, we moved from the loose guarantee of eventual consistency all the way to the strong recency requirement of linearizability.

We realized linearizability demands a total order of operations, and that the fault -tolerant necessary way to achieve that total order and enforce critical constraints is through the very difficult problem of consensus.

Yeah, and the final takeaway is that the consensus problem is really inescapable.

Even a database that gives you linearizability with a single leader is only, you know, kicking the can down the road.

It's running consensus less frequently only during a leader failover.

But at the end of the day, consensus algorithms are what bring concrete mathematical certainty to the unpredictable, uncertain world of distributed systems.

Thank you for joining us for this deep dive into the guts of distributed systems theory.

And here is a final provocative thought for you to chew on.

Given that linearizability is fundamentally slow because its performance is basically held hostage by network delay uncertainty,

and given that the slightly weaker but highly available guarantee of causal consistency is often enough for many applications,

imagine the possibilities if our data systems truly embraced partial ordering and concurrency natively rather than constantly fighting the fundamental physics of the network just to impose a total order that is in a geographically distributed system,

often completely artificial.

We'll see you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Establishing reliable agreement mechanisms across distributed systems requires understanding how nodes can coordinate despite network failures, latency, and machine crashes. Consistency models form the theoretical backbone of this challenge, ranging from eventual consistency, which guarantees that replicas will synchronize once writes pause but offers no timing guarantees, to linearizability, a stringent requirement that presents data as though managed by a single infallible system where all operations occur in a well-defined sequence. Linearizability proves indispensable for critical coordination tasks including preventing split brain scenarios during leader elections, managing exclusive access through distributed locks, and enforcing constraints such as unique usernames where duplicate values cannot coexist. The cost of linearizability emerges clearly through the CAP theorem: systems cannot simultaneously maintain consistency, availability, and partition tolerance, forcing practitioners to choose between strong consistency and continued operation during network splits. Causal consistency occupies middle ground by preserving cause-and-effect ordering between related operations while avoiding linearizability's performance penalties and remaining available despite partitions. Ordering mechanisms including Lamport timestamps create total sequences compatible with causal relationships but cannot independently validate uniqueness constraints in real time. Total order broadcast solves this by guaranteeing all nodes receive identical message sequences in fixed order, and this mechanism proves mathematically equivalent to both linearizable atomic operations and the core consensus problem. Distributed transactions introduce additional complexity through the two-phase commit protocol, where a coordinator orchestrates agreement through a prepare phase where participants promise commitment, followed by a definitive commit or abort decision. Two-phase commit suffers from a critical vulnerability: coordinator failures leave participants blocked indefinitely in uncertain states, unable to release resources or proceed, violating the consensus requirement that algorithms must eventually terminate. Production-grade consensus algorithms such as Paxos, Raft, and Zab address this limitation through epoch numbering and strict majority quorum mechanisms, ensuring systems can reach binding agreement even when components fail. These robust algorithms underpin coordination services like ZooKeeper and etcd, which provide the dependable distributed infrastructure modern applications require.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 9: Consistency Models & Consensus Algorithms

Related Chapters