Chapter 14: Consensus
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome back to the Deep Dive Learner.
Today, we are tackling a topic that often sits right at the intersection of some really high theory and hard -nosed engineering distributed consensus.
It really is.
I mean, this is hands down what people call the pinnacle of distributed systems research.
And that phrase is absolutely earned.
It is.
Consensus is like the bedrock.
This isn't just some academic theory you draw on a whiteboard.
It's the fundamental guarantee.
It's what allows these unbelievably complex geographically distributed systems, the very backbone of the modern internet, to work reliably and, most importantly, consistently.
Even when parts are failing all the time.
Exactly.
Even when components are constantly crashing.
I mean, think about systems you probably use or have heard of like Apache Zookeeper, et cetera, or Consul.
At their core, they're fundamentally consensus machines.
So our mission today is really to unlock that knowledge, the secret sauce that ensures when you write data to a system spread across maybe a dozen, even hundreds of machines, every single participant agrees on exactly what happened.
And crucially, in what order?
Yeah, because if they didn't agree, your configuration store would start returning conflicting values and your database would just rapidly fall into this chaotic mess of conflicting facts.
So we have to start with the big question.
Why is this so hard?
Right.
And we have to establish why getting everyone to agree is so astonishingly difficult.
When you're in a distributed world, you face these inherent challenges.
And the most famous one is probably the FLP impossibility theorem.
Yes, the fairest impossibility result.
OK, let's unpack this one properly because it really defines the boundaries of what we can and I guess what we cannot guarantee.
The FLP theorem, it's named after Fisher, Lynch and Patterson.
It essentially lays out this crushing constraint.
It says that in a completely asynchronous system.
And asynchronous just means we can't make any assumptions about how long messages take to get from A to B, right?
Exactly.
No guarantees on message delivery times and nodes might crash at any moment.
In that world, it is mathematically impossible to guarantee consensus in a bounded amount of time, even if only one single process crashes.
That sounds like a complete dead end.
I mean, if we can't guarantee it, how do we build these multibillion dollar reliable systems like Google Spanner?
Are we just, I don't know, ignoring the theorem?
We're not ignoring it.
We're working around it by changing the assumptions.
OK.
See, since a process can't reliably tell the difference between a peer that has crashed and a peer that's just being really, really slow, we have to use these external mechanisms, things like failure detectors.
But the source text points out that this reliance forces a critical trade off.
We have to guarantee safety.
Meaning the system must never do the wrong thing.
Never enter an inconsistent state,
but we can only strive for liveness.
So liveness just means the system will eventually make a decision.
OK, so what happens if the failure detector gets it wrong?
Like if it thinks a node is faulty, but it's just slow?
How does favoring safety play out there?
It means the system might just stop.
It might stall.
If a failure detector wrongly suspects the leader is down, the system might trigger a whole new election.
And that pauses all progress until a new leader is sorted out.
Which guarantees safety because no conflicted decisions are being made in the chaos.
Exactly.
But it kills liveness, or at least delays it unnecessarily.
So you end up trading quick deterministic progress for absolute unwavering correctness.
And really the complexity of all modern consensus protocols flows right from that one concession.
That context is just so important.
We compromise on speed to make sure we're always correct.
So before we jump into the algorithms themselves, let's just lay out the three theoretical properties that any real consensus algorithm has to satisfy.
These are sort of the goalposts we're aiming for.
Exactly.
First up is agreement.
This is the core of it all.
The decided value must be the same for all the correct non -failed processes.
Right.
If half the cluster thinks my bank balance is $100 and the other half thinks it's $50, you have failed agreement spectacularly.
You have failed.
It's about unanimity among the functioning parts of the system.
Okay.
Second is validity.
Now this one's a bit more subtle.
Why is it so important to prevent the system from just agreeing on some, you know, arbitrary default value?
Well, think about it.
If the system could just agree on zero every single time there was a conflict, it would actually satisfy agreement and termination.
But it would be totally useless.
It would be useless.
Validity ensures that the decided value must have actually been proposed by one of the participants.
It connects the consensus outcome back to a legitimate input from the real world.
And finally, termination.
Termination just means that all the correct processes have to eventually reach a decision.
You can't have an algorithm that just locks up the system and waits forever for a crash process to come back.
I mean, it would satisfy agreement and validity by doing nothing, but it's not usable.
In the real world, this decision has to happen, you know, reasonably quickly.
Okay.
Let's start at the very bottom layer then.
Communication.
Before we can agree on anything, we all have to be able to talk to each other reliably.
And the simplest tool for that is a broadcast.
Right.
And the most basic level of that is the best effort broadcast.
It's basically fire and forget.
The sender is 100 % responsible for getting the message to all its targets.
So if the sender crashes halfway through.
Then the message just fails silently for anyone who hasn't gotten it yet.
No one else steps in to help.
Which, if you're trying to write data to replicas, sounds like a recipe for instant inconsistency.
Oh, it is.
And that's precisely why we need something better.
The reliable broadcast.
This primitive guarantees that all the correct processes receive the same set of messages, even if the original sender crashes midstream.
The goal is message atomicity.
Either everyone gets it or no one does, among the non -failed members.
So how do you achieve that without a central coordinator?
The source material mentions a naive way of doing it, using flooding.
Yeah, the basic idea is you rely on the other nodes to pick up the slack.
So if process P1 starts a broadcast and then crashes, the other processes see that it failed and they continue broadcasting on its behalf.
If process P2 gets the message, it just forwards it to every other process it knows about.
And we can see that in figure 14 .1.
Broadcast P1 starts, sends to P2, P3, and P4, and then they all just keep rebroadcasting it to each other.
They do.
But the downside of this flooding thing is the message complexity.
It's the scalability killer.
It absolutely is.
This naive approach results in what we call N222 message complexity, where N is the number of recipients.
For a system with dozens, or god forbid hundreds of nodes, this becomes unbearably chatty.
It just wastes a ton of network bandwidth.
It's reliable, sure, but it's super inefficient.
Okay, so reliable broadcast ensures everyone agrees on the set of messages.
But not the order.
And that brings us to the next, much more critical layer of complexity.
Atomic broadcast, or what's often called total order multicast.
This is the huge conceptual leap.
This is what connects communication to state consistency.
Reliable broadcast makes sure everyone gets the same pile of mail.
Atomic broadcast makes sure everyone opens that mail in the exact same sequence.
And that's everything.
It's everything.
It guarantees two things.
Atomicity all non -failed processes deliver the message, or none do.
And order all non -failed processes deliver the messages in the exact same sequence.
And the connection is key.
If every single replica applies the same sequence of commands to its local state machine, every replica is going to arrive at the exact same final state.
Exactly.
And this is why the source points out this critical equivalence.
Atomic broadcast is equivalent to solving consensus in an asynchronous system with crash failures.
If you can solve total order message delivery, you've solved the core problem of agreeing on a sequence of state changes.
Now before we get into the big consensus protocols, the source talks about another framework called virtual synchrony.
It's designed for systems where the members are constantly changing.
Right.
Virtual synchrony is a refinement that's specifically for group communication in dynamic groups.
Unlike atomic broadcast, which often assumes a static set of participants, this is for when nodes are joining and leaving all the time.
It organizes processes into these cohesive units called group views.
And a group view is what, exactly?
Like a list of who's currently in the club.
That's a perfect way to put it.
A view change is when someone new joins, or someone drops off.
Or a node is detected to have failed.
And before any new conversation, any new data exchange can happen, everyone has to first acknowledge and agree on the new participant list.
The new group view.
So how does that enforce atomic delivery?
The rules is very strict.
A message sent in one view can only be delivered within that same view.
So virtual synchrony makes this distinction between a message being received and it actually being delivered to the application.
If the group view changes before all the non -failed processes have received a message, that pending message is barred from delivery in the new view.
So it's either discarded or has to be recovered somehow.
This ensures that no messages can cross that inconsistency boundary that's created by a membership change.
It guarantees atomicity relative to the state of the group.
That's a clever way to handle churn.
But the source does note that it hasn't seen huge commercial adoption.
Why do engineers usually opt for something else?
It often just boils down to performance and how complex it is to implement.
The industry has generally favored what are called sequencer -based algorithms, like Zary B, which we're about to talk about, or multipaxos.
These use a single powerful coordinator to decide the total order.
They tend to be easier to build and just offer better performance, lower latency, because you get to bypass all that complicated coordination you need for every single view change.
The trade -off being you are betting big on your failure detector to notice when that leader goes down and replace it fast.
Exactly.
You're all in on that leader.
That optimization leads us perfectly into ZooKeeper.
So let's look at how Zary B, the ZooKeeper Atomic Broadcast Protocol, takes that sequencer idea and builds a really high -performance system around it.
Right.
Zary B's core engineering insight is basically trading those continuous consensus checks for sheer throughput.
And it does that by banking heavily on a stable, long -lived leader.
It's a simple two -role system.
You have a leader and you have a follower.
The leader is the temporary but stable driver of the whole process.
It establishes the event order, broadcasts proposals, and clients can connect to any node.
But if it's not the leader, they just get forwarded, right?
Yep.
And to guarantee that the leader is unique and to keep things safe across restarts, ZAB introduces the concept of epochs.
Monotonically increasing numbers.
Exactly.
The protocol timeline is split into these uniquely numbered
monotonically increasing epochs.
And this provides that necessary global ordering.
There can only be one leader during any given epoch.
When a new leader wants to emerge, maybe after a network partition heals or the old leader crashes, it has to go through the ZAB protocol in three distinct phases to win the right to propose.
Okay.
Let's walk through those three phases.
We can reference Figure 14 -2 ZAB protocol summary here.
Let's focus on the safety guarantees of each one.
Okay.
So phase one is discovery.
A prospective leader first has to figure out the state of the system.
It learns the latest epoch and the latest transaction ID known by every other process.
And then, crucially, it proposes a new epoch number, which has to be higher than any current epoch known by any follower.
And why is proposing a new higher epoch number a safety guarantee?
What does that do?
It acts like a strict barrier.
By using that monotonically increasing number, ZAB ensures that after this step, no process in the system will accept proposals for any earlier epochs.
It immediately prevents old, stale leaders or zombie proposals from coming back and corrupting the system's state.
Okay.
So that sets the boundaries.
Then we move into phase two, synchronization.
And this seems absolutely critical for consistency and recovery.
It is.
Synchronization has two main goals.
Formally establish the new leader and make sure history is shared.
The prospective leader sends out a message proposing itself for the new epoch and it collects acknowledgments or ACKs.
Once it gets a quorum of ACKs, the leader is formally established for that epoch.
And the recovery part of that, how does it heal the state?
Well, the new leader has to make sure that all the followers have the exact same consistent history before it starts taking new client requests.
And it does this by delivering any committed proposals from earlier epochs first.
It brings any lagging followers completely up to speed.
Only when every follower in the quorum has applied all the known committed proposals can the system move into the active broadcast phase.
That state healing is the really heavy lifting of ZAB recovery.
And finally, phase three, broadcast.
This is the active messaging phase.
Right.
Once sync is done, the leader is stable.
It gets client messages, assigns them a sequence number, and broadcasts the proposal.
It waits for a quorum of followers at ACK the receipt and only then does the leader send out a final commit message.
The source notes this is kind of like a two -phase commit, but without the ability for a client to abort.
The key advantage here seems to be just raw speed.
Absolutely.
The efficiency is enormous.
Once that leader is established, the broadcast phase only takes two rounds of messages, propose and commit for each transaction.
ZAB is basically betting on that long -lived leader to sequence events using his local view.
So it gets to skip these continuous, really expensive consensus rounds for every single operation.
So ZAB is a very fast way to do atomic broadcast once you have a stable leader, but like multi -Paxos, it really depends on that leader staying up.
Now let's move to the algorithm that basically defined this entire field.
Okay.
We're moving on to the foundational general consensus algorithm that inspired ZAB and so many others, Paxos.
This was introduced by Leslie Lamport way back in 1990 in a paper called The Part -Time Parliament.
Yeah.
And it's famous not just for being complex, but because Lamport originally wrote it up as this allegorical tale about the government on the ancient Greek island of Paxos.
And that led to like a decade of reviewers rejecting it because they thought it was too weird or too academic.
It wasn't until his simplified paper, Paxos Made Simple, came out years later that the industry finally started to get its head around its brilliance.
That history definitely speaks to its complexity, but the core mechanism is actually pretty elegant.
It solves the single -decreed consensus problem, just agreeing on one value using three core roles.
Okay.
So first we have proposers.
Proposers get values from clients, create proposals, and then they try to collect votes.
Then you have acceptors, who are the ones who vote to accept or reject those proposals.
And crucially,
acceptance requires a quorum, which is just a majority of acceptors.
And finally, the learners.
Learners are the passive ones.
They're like the replicas that just store the agreed -upon outcomes.
It's really important to note that in almost every real -world implementation, a single physical process is going to play all three of these roles at the same time.
The central mechanism that enforces order is the proposal number.
Yes.
Every proposal has a value and a unique monotonically increasing proposal number.
This number is like the global timestamp for decision -making.
If two nodes propose at the same time, the one with the higher number gets precedence.
So let's detail the two phases of classic Paxos.
We're looking at figure 14 -3, Paxos algorithm, normal execution.
Okay, so phase one is the proposed phase.
We're sometimes called the voting phase.
The proposer, who wants to lead this round and commit a value, sends a prepareN message, where N is its unique proposal number, to a majority of acceptors.
It's essentially asking, hey, can I lead this round with number N?
And the acceptors have to respond really carefully to preserve safety.
The rules they follow are the real genius of Paxos, I think.
Let's walk through those four core invariants.
Okay, the first invariant is like a lockout rule.
If an acceptor hasn't yet responded to a prepare request with a higher sequence number, it makes a promise.
It promises it will not accept any proposal with a lower sequence number.
This makes sure that old, maybe failed proposals, can't just sneak back into the system later and overwrite a newer decision.
Okay, and the second invariant is about history.
If the acceptor has already accepted a value in a past round, it has to report that back.
Right.
It responds with a promise, accepted, where it reports the highest previously accepted value it's ever seen.
The third rule deals with competition.
If the acceptor has already promised a higher numbered proposal, it tells the current proposer about it, which basically forces the current proposer to back off and try again with an even higher number.
And the fourth is just flexibility.
An acceptor can respond to multiple prepare requests as long as the later one always has a higher number.
So the proposer collects a majority of these promise responses.
This establishes its temporary leadership for this one round.
And now we get to phase two, replication.
The proposer sends an accept message to that quorum.
Now the value V that it proposes is chosen very carefully based on the responses from phase one.
If any of the acceptors reported a previously accepted value, the proposer must choose the value that was associated with the highest numbered proposal it heard about.
This is the key safety rule.
It is.
It ensures that a value that was already decided upon is never forgotten.
If no old proposals were reported, then the proposer is free to use its own client's value.
So the rule is, if a value has already been chosen by any majority in the past, a new proposer has to adopt that value to maintain consensus.
If not, it can introduce a new one.
Exactly.
The acceptor accepts the proposal N as long as it hasn't already made a promise to ignore it.
Once it accepts, it immediately tells the learners to increase replication and let the client know the result as fast as possible.
And this all hinges on the power of quorums.
Why does just requiring a majority guarantee absolute safety, no matter what fails?
Because of the intersection principle.
Once a value is accepted by a majority of participants, any future attempt to propose a new value has to, by definition, consult a majority of participants.
And any two majorities must intersect at at least one participant.
So that single intersecting node is basically the system's memory.
It's the critical witness that guarantees any new proposal knows what the old proposal agreed on.
Precisely.
That intersecting node, because it's following its invariance, will report the previously accepted value, if there was one.
This forces the new proposer to adopt that old value, guaranteeing safety across failures and rounds.
It's math, not luck.
But liveness still needs the $2 plus a level rule.
To tolerate failures, you need $2 plus wallet total processes just to make sure a functioning majority of $5 plus level one can always proceed.
PAXOS guarantees safety even when things go completely sideways.
So let's walk through the three key failure scenarios, the source details, to see that intersecting quorum in action.
Okay, let's take failure scenario one, which is in figure 14 -4.
A proposer, P1, successfully finishes the proposed phase, but then it fails after sending its value, V1, to only one acceptor, A1, in the accept phase.
Okay, so P1's down, only A1 knows about V1, and V1 is not committed because it only got to one acceptor.
Right.
So a new proposer, P2, starts round two with a higher proposal number,
P2 sends out a prepare to and gets a quorum of responses.
Let's say that quorum is A1 and A2.
Critically, A1, our single witness, reports back that it already accepted V1.
And P2 has to follow the rules.
It has to.
So P2 must choose V1 for its value, and then it successfully commits V1 to the whole quorum.
The system reaches consensus on the old value, V1, even though the original proposer failed.
No data loss.
Okay, now failure scenario two, from figure 14 -5, is a little more subtle.
It shows how a new proposer can commit a new value without breaking safety.
Right.
So in this one, P1, again, fails after sending V1 only to A1.
P2 starts round two.
But this time, P2's quorum is made up of A2 and A3.
And crucially, that quorum does not overlap with A1.
Oh, okay.
Since A2 and A3 haven't accepted any prior value, P2 is totally free to commit its own value, V2, with its accept message.
But how is that safe?
A1 is still sitting over there holding V1.
Won't that cause a conflict later on?
It's safe because of the proposal number.
The unique proposal number two, assigned to V2, is higher than P1's, which was one.
So any future round, say round three, that successfully gathers a quorum, has to include at least one of A2 or A3.
And they now know about V2.
Which is the highest numbered value.
This ensures that any subsequent proposer will be forced to adopt V2, not the older V1.
The proposal numbers and the intersecting quorum for future rounds work together to keep it safe.
And the trickiest one.
Failure scenario three, from figure 14 to 6.
This is where the acceptor A1, which accepted V1, fails right after.
Yep.
So if P1 fails, A1 accepts V1, and then A1 itself fails, P2 can start round two and commit V2.
Because this quorum, A2 and A3, again, doesn't overlap with the now -failed A1.
This is safe because V1 was never actually committed by a majority.
It was only accepted by one node, which then disappeared.
Exactly.
So consensus focuses on the highest numbered proposal that achieved a quorum.
If P2 successfully commits V2 with proposal number two, then any future proposer checking history will find that V2 is the highest committed value.
The older V1, proposal number one, is effectively forgotten.
Okay, but even Paxos has a vulnerability to just, you know, bad timing.
What happens if two or more proposers are constantly competing, sending prepare messages back and forth, and stepping on each other's toes?
That's the competition problem.
It's a classic liveness failure.
Proposer A starts a round with proposal number five.
Proposer B sees this and immediately starts its own round with number six.
AC6 retries with seven, B retries with eight.
They just keep interrupting each other, and neither one can ever collect the stable majority to get to phase two.
So the system is just stuck in this endless loop of failed elections.
Right, and the fix here is really just practical engineering over pure theory.
You incorporate random back -off.
This just ensures that eventually one proposer will sleep for long enough that the other one can successfully win a prepare phase and go on to commit.
It's a mechanism to ensure probabilistic liveness, because we already know deterministic liveness is impossible.
Is that a serious concession, though?
Giving up on deterministic liveness guarantees for an algorithm that's all about absolute safety.
It is a pragmatic concession, for sure.
But since we already know from FLP that strip liveness is impossible, we use randomization to guarantee liveness probabilistically.
It's an essential bridge between the theoretical ideal and a system that actually works in the real world.
Now, running those two phases, propose and accept for every single decision, is incredibly slow and chatty for a high -throughput system.
Which brings us to multipaxos.
Multipaxos is the necessary efficiency boost.
It introduces the idea of a distinguished proposer or a leader.
Once the leader is established, they can skip that expensive propose phase for all the subsequent values and just move straight to replication, only sending the accept messages.
So it transforms paxos from deciding on a single value to deciding on a sequence of values, like an append -only log.
Exactly, which dramatically increases throughput.
But it creates a new, dangerous problem.
How do you make sure that a non -leader process doesn't serve up stale data?
What if a new leader was secretly elected and the old one doesn't even know it's been demoted?
That old leader might provide a value based on its old, incomplete view of the log.
Right, so to guarantee linearizability, which is the strongest consistency model, without having to run a full consensus quorum check on every single read,
multipaxos implementations often use leases.
So a lease is like a time window where the leader has proven its authority.
That's a great way to put it.
For a certain amount of time, the leader has authority, and followers agree not to vote for anyone else.
This lets the leader serve fast reads without a full quorum check.
It just has to periodically contact the other nodes to renew its lease.
But the safety of that optimization depends completely on an assumption about clock synchronization.
Yes, it relies on an assumption of bounded clock synchrony.
If the clocks drift too much, the safety of the lease can be compromised.
If a leader's clock is slow and it thinks its lease is still valid when it's actually expired on all the other nodes, a new leader could be elected.
Then you have two active leaders and total inconsistency.
Okay, finally, let's touch on flexible Paxos.
This variant gets into the idea that we don't necessarily need a strict majority for every step of the algorithm.
This is a deep engineering insight.
It's all about tuning performance.
Flexible Paxos defines the core safety condition as an inequality.
Q $ plus Q $2.
Okay, so N is the total number of nodes, Q $1 is the size of the proposed quorum for leader election, and Q $2 is the size of the accept quorum for replication.
Why does that inequality guarantee safety?
Because if the sum of the proposed and accept quorums is greater than the total number of nodes, it mathematically guarantees that the set of nodes that elected the leader, Q $1, must overlap with the set of nodes that accepted the proposal, 2 $2.
And that intersecting node is the witness we need to enforce safety.
Which lets designers choose their trade -off.
Exactly.
You can trade availability for latency.
The replication phase, two two -node atolls, is the most common operation.
So you could reduce its size to get faster rights, lower latency.
But you have to then increase the size of Q $1 during the infrequent leader election to maintain that necessary overlap.
So in a five -node cluster, you could have a replication quorum of just two, but then your leader election quorum would have to be four.
Precisely.
$4 plus two two -toler five.
You've maintained safety.
Paxos gave us the rules, but all these variants show how engineers are constantly pushing the boundaries for speed and throughput.
Let's start with Fast Paxos, which is designed for just raw latency reduction.
Fast Paxos is super aggressive.
Its goal is to execute a command in two communication steps instead of the usual three or four.
And it does this by letting clients, or really any proposer, contact acceptors directly, trying to bypass that centralized leader bottleneck entirely.
What's the cost for skipping the coordinator?
There has to be one.
There is.
And it's a massive increase in redundancy.
To maintain safety, the requirement to tolerate five failures goes from needing $5 plus dollar acceptors to needing $2 plus plus a lot of acceptors for a fast quorum and a total cluster size of $3 plus and $1 out.
You just need a lot more machines to get the same level of fault tolerance.
And the mechanism here involves classic versus fast rounds, which we can see in Figure 14 -7.
Right.
In a fast round, the coordinator can issue this special any message.
And that message basically authorizes acceptors to independently decide on values they receive from any proposer.
This parallel approach can speed things up immensely because you aren't waiting for the coordinator to sequence the value.
But the huge danger is the collision problem.
If multiple clients are submitting different values at the same time to different acceptors, you could have chaos.
You do.
If multiple proposers try to use that fast path at the same time, acceptors might decide on conflicting values locally.
When the coordinator sees this conflict, because the responses it gets back are inconsistent, it must trigger a recovery round, which is a full, slow, expensive classic Paxos round.
So if you have a lot of conflict, fast Paxos is actually slower than multi -Paxos.
Much slower.
This optimization is really only viable for environments with very low contention.
It's a classic distributed systems trade -off.
Optimized for the common case, but pay a huge penalty in the failure case.
OK.
Let's pivot to egalitarian Paxos or ePaxos.
This tries to solve the leader bottleneck without falling into the chattiness of fast Paxos.
ePaxos is fundamentally different.
It tries to avoid the leader bottleneck by just abandoning the idea of total order for all commands.
Instead, it maintains a partial order by establishing dependencies between commands.
So commands A and B only need to be ordered if their execution order actually matters, like if they touch the same piece of data.
Exactly.
If two clients were updating totally different keys, they can run independently and at the same time, which vastly improves throughput.
The way it works is that every process can become a leader for its own proposal.
It sends out a pre -accept phase message, which includes the command, a list of its estimated dependencies.
So all commands that might interfere but aren't committed yet, and the sequence number.
This goes out to a fast quorum, which is usually CL3 -4 CL replicas.
And then it figures out if it can take the fast path or if it has to take the slow one.
And we can see both of those in figure 14 -8.
The fast path, which is the P1 run in the figure, happens if the fast quorum responds and they all agree on the dependency list.
If they match, the leader commits immediately, no second round.
The slow path, the P5 run, happens if dependencies conflict.
The leader then has to update its dependency list, combining everything it heard, and get a larger secondary quorum to confirm before it can commit.
The real genius seems to be in the execution phase.
It is.
Replicas don't execute based on a single global index.
They build a dependency graph and they execute commands in reverse dependency order.
So before you execute command X, all of its dependencies have to be executed first.
For workloads with low interference, this allows for just incredible parallel execution and high availability, because consensus is happening simultaneously across multiple non -interfering leaders.
The complexity of all this, though, it really highlights how hard it is to reason about Paxos, which brings us to this other theoretical approach mentioned in the source, the generalized solution to consensus.
Yeah, this is a newer theoretical take that reframes Paxos entirely.
It shifts the focus away from all these complex actor interactions, proposer, acceptor, learner, and towards a much simpler idea.
Write once registers on the servers.
So instead of thinking about who is promising what to whom, we just think about the immutable state that's stored in these registers.
Exactly.
Each register can only be in one of three simple states.
It's unwritten, it contains a specific value, or it contains nil.
And sets of these registers form quorums that can be in states like any, maybe V, or decided V.
And decided V means the quorum has reached consensus on the value dollar.
Right.
And the algorithm uses these registers to get to consensus in two phases, as you see in Figure 14 -9.
The client drives it, Phase 1 checks if it's safe to write.
The client reads a majority of registers.
If it finds a value that was already written, it has to adopt that value.
If they're all unwritten, it can choose its own.
Which mirrors the Paxos proposed phase, where a proposer has to collect past accepted values.
Exactly.
And Phase 2 commits the value.
The client just sends a message to all the servers with the value it chose in Phase 1.
If a majority respond, the decision is made.
It just reframes the complex rules of Paxos into these simple immutable state checks.
It really simplifies the reasoning by abstracting away the communication and focusing purely on state stability.
For a decade, Paxos was the consensus algorithm.
But its complexity scared a lot of people off.
That changed dramatically around 2013 with the introduction of Raft.
And Raft had one single overriding goal.
Understandability.
Yeah.
Raft's success comes from its designers realizing that complex systems often fail because of implementation bugs.
They just wanted a simpler, more intuitive algorithm that engineers could confidently build and reason about.
It takes a lot of cues from multipaxos and ZAB by making the leader concept totally central and making the log the single source of truth.
And Raft simplifies the terminology so much compared to Paxos.
There are just three very clearly defined roles.
You have followers who are just passive recipients of data.
Kind of like Paxos acceptors and learners.
Then you have the leader, the single elected cluster leader for a specific term, who handles all client requests and all log replication.
And the third role is temporary.
Candidate.
Right.
Any follower can become a candidate to try and collect a majority of votes and become the next leader.
Every process starts out as a follower.
And instead of epochs or proposal numbers, Raft uses terms.
Terms are Raft's way of getting a global sense of time and order.
They're just monotonically increasing numbers.
And during any given term, there's only one leader.
If a node ever sees a message with a higher term number than its own, it immediately updates its term and reverts to being a follower.
It enforces a very clear hierarchy.
The election process is critical.
And we see it in Figure 1410.
How does an election even start?
An election is triggered by silence.
A follower starts an election if it hasn't received a heartbeat from the leader within its election timeout.
It just switches to the candidate state,
increments its term, votes for itself, and sends a request vote message to all the other nodes.
And that message contains two vital pieces of information.
The candidate's term and the ID of its last log entry.
Why is that log entry ID so important during the vote?
This is the core safety requirement of Raft.
To win an election, a candidate must have the most up -to -date log.
And the followers enforce this rule.
They will deny a vote if the candidate's log is less up -to -date, meaning it has a lower term ID or just a shorter log sequence.
This guarantee prevents a leader from being elected whose history is incomplete, which could cause it to accidentally overwrite already committed values.
What's the mechanism that prevents two candidates from just tying in the vote over and over again?
Raft uses randomized timers for that election timeout.
Each follower has a randomly chosen, slightly different timeout duration.
So if a vote splits evenly, one candidate is bound to timeout a little bit earlier, start the next term, and get a majority while the others are still waiting.
It resolves the contention probabilistically.
Once elected, the leader is then completely responsible for log replication.
And the flow is strictly one way, right?
Leader to follower.
Yep.
The leader appends client requests to its own log and then sends them in parallel append entries messages to all the followers.
And these messages are vital for consistency because they include the index and term of the log entry that came immediately before the new one.
And that check of the preceding entry is the core safety mechanism that lets Raft manage log inconsistencies so efficiently.
It is.
If a follower finds that the preceding index and term sent by the leader don't match its own records, it just rejects the new entry.
This consistency check ensures that if two log entries on different replicas have the same term and index, they store the exact same command.
And all the entries before them are identical.
If they don't match, the leader forces the follower to roll back its log until they're consistent, overwriting the follower's bad history.
And when is an entry actually considered committed?
We can see this in Figure 1411 and the same machine in Figure 1412.
The key is that only the leader can commit.
Once the leader gets ACKs confirming that a log entry has been replicated to a majority of followers, that entry is marked as committed in the leader's log and applied to its state machine.
The leader then tells the followers about the commit decision in subsequent messages.
And committing an entry also implicitly commits all the entries that came before it in the log.
So Raft's core safety guarantee is all about log immutability and these strict leadership rules.
Right.
Raft guarantees that committed log entries are immutable.
They cannot be reverted or changed.
They're guaranteed to be present in the logs of all subsequent leaders because of that vote safety check.
The leader can't remove or reorder its log.
It only appends.
And this focus on log management and clear rolls is exactly why Raft is so much easier to reason about than Paxos, which led to its massive adoption in systems like CockroachDB, etc.
and Console.
Okay, so up until now, we've assumed what are called non -Byzantine failures or crash faults.
Nodes might crash, they might be slow, they might get partitioned, but they don't actively lie.
They don't forge messages or behave maliciously.
But the real world is messy.
So what happens when you just can't trust the participants?
What if some nodes are actively adversarial, trying to break the system or steal data?
This is the domain of Byzantine failures.
And this just changes the whole computational landscape.
To solve consensus when you have malicious behavior, you need extensive cross -validation and cryptographic proof for every single step.
The overhead just skyrockets.
Most Byzantine fault -tolerant algorithms need $10 two messages per step because every node has to check every other node's behavior against the verifiable majority response.
And the replication requirement is much stricter than the $2 plus dollars we needed for crash faults.
What's the mathematical necessity there?
To sustain FIG dollar -compromised nodes, you need a minimum of $1 plus 3 AF plus 11 total nodes.
And the reason is simple, but it's really powerful.
We have to guarantee that even if $3 nodes are faulty -so, actively malicious, and another dollars are just slow or partitioned and not responding, there are still $4 plus dollar non -faulty active replicas left to agree on the value.
Which ensures that the consensus of the honest nodes always outnumbers the active faulty ones.
Always guaranteeing safety.
The classic real -world example here is Practical Byzantine Fault -Tolerance, or PBFT.
Right, PBFT assumes independent node failures in weak synchrony.
To stop message forging and tampering, all the communication is typically secured.
Nodes use cryptographic signatures and public keys to verify identities.
And like other leader -based protocols, PBFT uses views to identify the cluster configuration, with one node as the primary, or leader, and the rest as backups.
And a client request triggers PBFT's really high overhead three phases, which are shown in Figure 1413.
Okay, so Phase 1 is Pre -Prepare.
The primary broadcasts the client request.
It includes a unique ID, the view ID, and crucially, a payload digest.
This is computed with a strong cryptographic hash and signed by the primary.
Backups will only accept it if the view matches and the digest verifies the request hasn't been tampered with.
So the digest is a huge optimization.
It proves the payload is intact without having to send the entire, possibly large, request in the next very message -intensive phases.
Precisely.
So Phase 2 is Prepare.
If a backup accepts the pre -prepare, it broadcasts its own prepare message, with the view ID, message ID, and that same digest to all other replicas.
The system has to reach a consensus just on having received this message.
And why does a node have to wait for $2 matching prepare messages before moving on?
Why that specific number?
Because of that $3 plus dollars requirement.
By waiting for $2 matching messages from different backups, a node is guaranteed that at least $5 plus a lot of those matching messages came from honest replicas.
This definitively outnumbers the baller roll potentially malicious ones.
You need that high threshold for cross -validation when you're up against arbitrary malicious behavior.
And then Phase 3, Commit.
Nodes broadcast commit messages and they wait for $2 plus a dollar matching commits, which can include their own.
Only after hitting that $2 plus a dollar consensus threshold does the node finally execute the command.
And the client, on its end, waits for a few dollars plus dollar identical responses to confirm success, which ensures the response didn't come just from the set of faulty replicas.
PBFT is clearly the heavy -duty solution, and it must need robust ways to verify state, given the risk of a malicious primary just lying during a recovery.
That's where stable checkpoints come in.
Just relying on a log is vulnerable if a primary lies about the log's contents.
So replicas periodically compute a digest of the state machine's current state.
After every end request, the primary broadcasts the latest sequence number and the state digest.
It waits for $2 plus number one replicas to agree on that state digest.
Those responses form a cryptographic proof that lets replicas safely throw away old log entries and verify state integrity during recovery, no matter what the primary does.
So PBFT is essential when you have zero trust, but that extensive cross -validation and the requirement for $3 plus dollar nodes, it just imposes this massive overhead compared to ZAB or Raft.
It's a fundamental choice between speed and absolute resilience in an adversarial world.
This deep dive has really shown that the journey from that FLP impossibility theorem to a functioning, reliable, distributed system is just paved with these careful, intentional trade -offs.
We started by establishing the need for total order through atomic broadcast, and we looked at ZAB, which gets great efficiency and throughput by establishing a long -lived leader and just minimizing communication rounds once it's stable.
Then we got the blueprint, classic Paxos.
It gives us the gold standard for safety through that intersecting quorum principle, forcing every new decision to honor the decisions made before it.
And we saw variants like multipaxos boost efficiency and flexible Paxos offering engineers these tunable trade -offs.
Raft then gave us the gift of understandability, simplifying the roles and making the immutable log the central organizing concept, which is why it's been adopted everywhere in modern infrastructure.
And finally, PBFT defined the critical requirements for safety in adversarial Byzantine environments, demanding $3 pulls plus other one replicas and extensive cryptographically validated cross -validation to rule out malicious nodes in every single step.
The fundamental achievement of all of these protocols is transforming a collection of independent failing computers into a single reliable distributed state machine.
Understanding these choices, complexity versus speed versus adversarial resilience, is just essential for evaluating the performance and dependability of any modern data system you're going to encounter.
Indeed.
The transformation is really profound.
So here is a final provocative thought for you to take away, Lerner.
Consensus algorithms like Raft and Paxos, by agreeing on a sequence of commands, inherently force a total order on events across the entire cluster.
So if you were designing the next great database, how would you leverage that absolute guarantee of total ordering, knowing that every single change across the entire system happened in a sequence that everyone agrees upon to build features or consistency models that current distributed systems really struggle to offer?
Yeah.
Think about using that global undisputed sequencing power to solve problems beyond just simple replication.
Maybe you could enable unprecedented auditability or some kind of complex global state management.
Thank you for joining us on this Deep Dive.
We hope this knowledge serves you well.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥