Chapter 13: Distributed Transactions
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome back to the Deep Dive.
Today we are cracking open, well, the source code of global consistency, you could say.
Yeah, we're looking at the complex blueprints that ensure that when a massive system, I mean one that's spread across data centers all over the world, executes an operation.
That operation is either completed everywhere perfectly or it fails entirely.
As if it never even happened.
It's really the ultimate engineering paradox of distributed systems.
How do you maintain perfect unwavering order when your data is inherently decentralized and, you know, prone to all sorts of failures?
Right, and this isn't just some problem.
This is the foundation of modern finance, e -commerce, pretty much any high -scale data operation you can think of.
So our mission today is a deep exploration of distributed transactions.
We're zooming in on the protocols,
the specific intricate algorithms that manage something called atomic commitment.
Which is all about getting multiple physical servers or partitions to agree.
We're moving beyond the world of a single database on one machine and into the high stakes world of global coordination.
And we'll be focusing on the trade -offs in durability, performance, and availability that define these systems.
Systems like Two -phase commit, Calvin, and Google Spanner.
And understanding these mechanisms is really non -negotiable for anyone who builds debugs or scales modern data platforms.
The protocol you choose for your transaction management, it dictates the absolute upper bounds of your system's performance.
And the severity of its Exactly.
You need to know not just what they are, but why they block or why they might risk inconsistency and what they actually cost in real world latency.
Okay, so let's lay the groundwork.
We've talked before about single object consistency.
But why do we need this more muscular concept like a distributed transaction for modern databases?
Because a single operation, say of financial transfer, or even just updating a user's profile, it's rarely just one single action.
It's a sequence of reads and writes that have to be treated as a unified, indivisible whole.
So we define a transaction as that atomic unit of execution.
A set of operations where if you look at the final state, either all the results became visible or none of them did.
Precisely.
And the classic money transfer example just illustrates this perfectly.
You're debiting account A and crediting account B.
Yep.
If you only complete the debit, the world has lost money.
And if you only complete the credit, money was just created out of thin air.
Both are totally inconsistent states.
The operation has to be atomic.
And even that simple debit, you know, just reducing the balance, it's internally complex.
You have to read the old balance, apply the change, write the new result back to disk.
And that whole sequence, which could take hundreds of microseconds, has to be protected from other things happening at the same time and from failure.
The flexibility of transaction processing is that you can often reorder or retry these operations, but that flexibility,
that's the root of the distributed coordination challenge.
So if we look at the entire lifespan of a database, we're trying to determine what are called permissible histories.
That sounds, I don't know, almost philosophical.
It is a little bit, but with math applied, we're modeling and representing all the possible ways that transactions could execute interleaved with one another, you know, on different cores or different machines.
And the main goal here is to achieve serializability.
Right.
A history is serializable if its dependency graph is equivalent to some history where the transactions just executed one after the other sequentially.
So serializability is kind of the theoretical gold standard.
It guarantees the end result is logically identical to a single threaded system, even if thousands of transactions are happening at once across your whole cluster.
Correct.
But getting this to work across multiple partitions is immensely harder than on a single partition database.
How so?
Well, local transactions use simpler concurrency control, either pessimistic, which is lock -based, or optimistic, where you try and then validate.
Okay.
But distributed transactions, the ones that span multiple servers, they require these dedicated multi -step commit protocols and the ability to coordinate rollback capabilities across machines that are physically separate.
And this requirement for recovery is, I mean, it's paramount.
What does it mean for a distributed transaction to be recoverable?
It means we need absolute certainty that if the transaction aborts or fails or times out at any point, the entire database state across all partitions is instantly reverted.
It has to look as if the transaction was never even tried.
You just can't have a partially executed right dangling out there, leaving things in an inconsistent state.
And this gets us to that famous insight by Lamson, which really summarizes the entire difficulty of this whole space.
The durability requirement applies universally.
Changes must be durably propagated to all nodes involved or to none of them.
That's it.
That all -or -nothing mandate.
That deceptively simple requirement is what forces the design of every complex, multi -round, highly chatty protocol we're about to discuss.
It's what drives the high latency that's always associated with strong consistency.
So before we jump into the protocols designed to manage all this, let's make sure we understand the structure they operate over.
Partitions.
We've been using the term partitioning or sharding.
Why is this so foundational to our discussion?
Well, partitioning is the only way to scale, really.
A single server just cannot hold all the data or handle all the traffic for a huge application like Twitter or a major banking system.
Right, you have to split it up.
So we logically divide the data into smaller, manageable segments called shards.
And each shard lives on a different physical server or a replica group.
And distributed transactions only become necessary when an operation needs to touch data that lives on two or more of these separate partitions.
Exactly.
So the two main ways to shard data are by range or by hashing.
Let's start with sharding by range.
Range partitioning is pretty simple to implement.
You define boundaries, say keys A through M go to server one and N through Z go to server two.
And clients use a routing key to figure out which replica set they need to talk to.
Yep.
But the critical trade -off here is load distribution.
You have to be careful.
What do you mean?
You have to size your partitions based on the distribution of your load and your values.
If your range boundaries are poorly chosen, you can end up with what's called a hot spot.
For instance, if you partition based on time, the current time range is always getting all the
Or if you partition by something like zip codes, the most popular city zip codes will become a hot spot.
And you get this huge load imbalance even though the data is logically divided.
So hashing seems like it would solve that immediate problem of range -based hot spot.
It does.
You compute a hash of the key and then you map that hash value to a node ID.
Since hashing tends to spread keys that are lexicographically close across different nodes, you dramatically reduce the chance of a contiguous data range causing congestion on a single node.
And the simplest way to do that is the modulo approach, right?
Hash V modulo N where N is the number of nodes.
Right.
But that modulo approach has this catastrophic Achilles heel when it comes to scalability.
Scaling problem.
It's the scaling problem.
If you add or remove just one node so N changes, the result of the modulo calculation changes for almost every single key.
Which means you have to redistribute the vast majority of your data across the entire cluster.
A massive, expensive, and time -consuming operation that basically grinds the system to a halt.
And this is what leads us to consistent hashing, the foundational technique used by systems like Cassandra and React to solve this scale -out problem.
What's the mechanism there?
Consistent hashing takes the hash values of all the keys and it maps them onto this imaginary ring that just wraps around.
Okay.
Each node also gets one or more positions on this ring.
And a node is responsible for the range of key values that fall between its own position and its predecessors on the ring.
That topology is the key.
So what happens when a node leaves?
When a node A leaves, only its immediate neighbors on the ring are affected.
Node B, which was A's successor, just assumes ownership of the range previously held by A.
And the rest of the cluster is untouched.
Exactly.
And conversely, when a new node joins, it only steals a portion of the data from its immediate successor.
On average, only K over N keys need to be relocated.
So it minimizes data movement and allows for much, much smoother horizontal scaling.
A huge improvement.
So we've established the data is scattered across these well -defined partitions.
Now we have to figure out how to commit an operation across all of them at the same time.
To guarantee that all or nothing behavior across these separate partitions, we need atomic commitment algorithms.
What is the single non -negotiable role of these protocols?
Their job is to elevate multiple remote operations to the level of atomicity.
They are purely concerned with reaching consensus on one binary decision, commit, or abort the transaction.
That's it.
And the core constraint here is absolute unanimity among all the participants or cohorts.
Unanimity is required, but it's really a negative mandate.
If even one participant votes against the commit, the entire transaction must abort.
That strictness is the only way to uphold the atomic guarantee in the face of partial failures.
It is.
And this strict requirement is why atomic commitment algorithms can't handle a specific, and often discussed, type of failure.
The Byzantine failure.
A Byzantine failure is one where a process actively lies about its state, or it misbehaves, or just decides things arbitrarily.
If a node can, for example, vote yes, but then secretly abort the transaction later.
The whole system's guarantee of unanimity just falls apart instantly.
Right.
Atomic commitment protocols have to rely on the assumption that nodes are honest, even if they fail because of crashes or network issues.
So if I'm a cohort, my job is extremely limited.
I can't influence the transaction content at all.
No, you're just a responder.
I just run my proposed sub -transaction locally, check if I can satisfy it, and then issue a binary vote, yes or no.
The database implementer gets to decide the low -level details, like when data is prepared in a pre -committed state, or how the fastest commit physically occurs, and the specifics of local rollback.
But the protocol itself only manages that global consensus.
And guiding this whole operation is the transaction manager.
The transaction manager is the central nervous system.
It's the subsystem that's responsible for scheduling, coordinating, executing, and tracking transactions across all the partitions.
And in a distributed environment, its key role is to make sure local visibility guarantees on each node line up with the global atomic operation.
Exactly.
It ensures that final state transition, the commit, happens across all partitions and all their replicas simultaneously,
or not at all.
So let's look at the classic way this agreement is enforced.
Two -phase commit.
So two -phase commit, or 2PC, is kind of the grandfather of atomic commitment protocols.
It's deceptively simple in concept, but its failure mode has driven decades of distributed systems research.
It is.
It relies on two main roles.
The coordinator and the cohorts.
If you imagine a diagram, like in figure 13 -1 in the source material, you'd see the coordinator initiating the process, holding the overall transaction state, and collecting votes.
And the cohorts are just the partitions that operate on the disjoint datasets the transaction needs to touch.
Yep.
And both roles have to maintain persistent local logs, which is absolutely crucial for durability and for recovery.
So phase one is the preparation and voting phase.
The coordinator kicks it off by sending a proposed message, containing the transaction details, to all the cohorts.
And each cohort executes its piece of the transaction locally.
They apply the rights, but they don't make them visible to the outside world yet.
Right.
They enter what's called a partially committed state, or a pre -committed state.
This is a critical point.
They've done the work, they've locked the resources, and they've persisted the changes to durable storage, but they haven't made it official.
And then comes the vote.
The cohort persists its decision, that yes, I can commit this, and then it responds with a positive vote.
And that positive vote is the most solemn promise in the entire protocol.
It's a point of no return.
It is.
The cohort is explicitly promising that it will not go back on this decision.
It will wait indefinitely, holding its locks and its pre -committed state for the coordinator's final command.
Indefinitely.
That word right there immediately flags a potential issue, which I'm sure we'll get to.
So phase two is the actual commit or abort.
The coordinator waits to collect votes from all the cohorts.
And only if every single one responds positively does the coordinator write the final commit record to its own durable log.
Then it broadcasts the final commit message to everyone.
And if even one cohort votes no, or if the coordinator just times out waiting for a vote.
Then the transaction is globally aborted, and the coordinator broadcasts an abort message instead.
So durability here hinges entirely on logging every single step.
Absolutely.
From the coordinator's proposal to the cohort's vote to the final commit decision, it all has to hit durable storage.
It's mandatory, because if a node crashes, it has to be able to look at its log, reconstruct its state, and figure out where it was.
Was it waiting for a vote?
Had it promised to commit, or had it already committed?
And that logging ensures the state can be recovered consistently.
Okay, let's analyze the failure modes.
This is where 2PC gets its reputation.
First, a cohort fails, like we see in figure 13 -2.
If a cohort fails during that proposed phase, the coordinator doesn't get a vote, it times out, and it has to abort the entire transaction.
And this immediately highlights a trade -off.
Availability is sacrificed for atomicity.
Totally.
A single -node failure prevents the whole transaction from going through, which reduces your overall system availability.
That's already pretty costly.
But the critical flaw, the one often shown in diagrams like figure 13 -4, is when the coordinator itself fails.
This is the infamous blocking issue.
Imagine the coordinator successfully collects positive votes from all the cohorts, but then it crashes right before it can broadcast the final commit decision.
So what happens to the cohorts?
They're left in a state of uncertainty.
They cannot proceed on their own.
They promise to commit.
So they can't just decide to abort.
And they don't know if the coordinator maybe managed to tell some of the other participants to abort before it died.
Exactly.
So they are locked, holding resources, holding locks, waiting,
potentially forever, for the coordinator to recover and tell them the final outcome.
So 2PC is defined as a blocking atomic commitment algorithm.
If that coordinator is permanently lost or just takes a really long time to recover.
Those transactions are permanently stuck, holding critical locks, and rendering those resources unavailable for anything else.
So the data is locked indefinitely, which causes resource starvation for future transactions.
And even though 2PC is simple, this blocking nature forces real -world implementations, you know, in things like MySQL or PostgreSQL, to invest heavily in really sophisticated recovery mechanisms and designated backup coordinators, just to try and minimize that window of uncertainty.
The blocking problem of 2PC was such an obvious flaw that, of course, the attempt to fix it led to three -phase commit, or 3PC.
And the goal was admirable, achieving non -blocking atomic commitment.
It was.
The core idea of 3PC, which you can see illustrated in figure 13 to 5, is to introduce a new intermediate step and then rely on aggressive timeouts.
And the idea is that cohorts, when they face a coordinator failure, can eventually time out and decide the transaction's outcome on their own.
Right, based on the shared state view provided by this new phase.
But,
and this is a huge but, this design rests on a fundamental and often deadly assumption about the network.
Which is?
It assumes a synchronous model and, crucially, that communication failures, specifically network partitions, are not possible.
Which in the real world is not a safe assumption.
Not at all.
If network partitions do occur, 3PC immediately becomes unsafe.
Okay, let's break down the three phases and see how this risk gets introduced.
Phase one is identical to 2PC.
Propose.
The coordinator sends the details, cohorts execute locally, enter that pre -committed state, persist their vote, and reply.
Okay, simple enough.
If the vote is unanimous, we move to the new phase two.
Prepare.
Right.
The coordinator sends a prepare message.
This tells all the cohorts that the transaction is now almost guaranteed to commit.
This step ensures that every participant knows that the majority is proceeding.
And the cohorts acknowledge that they've entered this prepared state.
Exactly.
And once all cohorts have acknowledged this state, this system is designed so that the transaction is considered committed, even if the coordinator fails during phase three.
And phase three is the final commit abort.
The coordinator sends the final commit command and the non -blocking part kicks in here.
If a cohort times out and hasn't heard from the coordinator, it relies on the state it reached in phase two.
So if it got to the prepare state,
it will eventually go ahead and commit.
But if it never received the prepare message, it assumes the coordinator failed earlier and it will unilaterally abort.
And now for the fatal trade -off.
The scenario in figure 13 -6.
The network partition and the resulting split brain.
This is three PCs undoing.
Imagine a network partition happens during phase two, the prepare phase.
The coordinator sends the prepare message, but only half the cohorts receive it.
So those nodes successfully move to the prepared state.
But the other half never receives the message.
So that first group is now prepared and guaranteed to commit after a timeout.
But what about the second group, the one that's partitioned off?
The second group, after it times out, it looks at its own state.
And since it never saw the prepare message, it assumes the coordinator failed and the transaction must be aborted.
So one half of the system commits the transaction and the other half aborts it.
That is a split brain scenario.
It's a fundamental inconsistency, a complete violation of atomicity.
So in systems that prioritize durability and consistency above all else, especially with realistic network failures,
three PC is just too dangerous.
It is.
It exchanged the blocking flaw for a catastrophic inconsistency flaw.
It's a classic CAP theorem problem.
Precisely.
It sacrifices consistency during a partition.
And that means it has seen very little practical, wide -scale use compared to two PC, despite solving the blocking issue in theory.
And it's also got higher overhead, adding a whole other round of messages.
OK, so if we conclude that coordinating the commit decision with two PC or three PC is inherently fragile or slow, the natural question is, can we avoid that coordination entirely?
This is the radical paradigm shift introduced by the Calvin Protocol.
Calvin is a deterministic transaction processing system.
What it does is it shifts the coordination cost entirely, from the execution phase to the input sequencing phase.
So you pay the cost upfront.
You pay the cost upfront.
The core idea is simple.
If every replica in your distributed system receives the exact same set of transactions in the exact same sequence, and they all execute those transactions deterministically, then they are guaranteed to produce equivalent outputs without any runtime locking or coordination messages.
Exactly.
It's incredibly efficient because you completely bypass the chattiness of two PC.
So how is this global deterministic order achieved?
It's achieved by a dedicated front -end component, the sequencer.
All transaction requests first hit the sequencer, and the sequencer determines the global transaction input sequence.
To manage scale and batch operations, it groups transactions into these short time windows called epochs, say, maybe 10 milliseconds long.
So the sequencer is the one responsible for the ordering.
How does it make sure every replica agrees on that order?
The sequencer itself has to use a robust consensus algorithm, something like Paxos internally.
This is to ensure that all the replica sequencers agree on exactly which transactions are included in the current epoch or batch.
And that batch then becomes the primary unit of replication and processing.
And from there, the ordered batch moves to the scheduler.
And what's its job?
The scheduler is responsible for deterministic execution.
Since the serial order is already guaranteed by the sequencer, the scheduler can analyze the transactions within the batch and execute the various subparts in parallel across different partitions, knowing that the overall serial order will be preserved.
It sounds like it dramatically reduces contention because the coordination on order is already paid for.
It does.
But it imposes a pretty significant architectural constraint on the application developer.
And what's that?
For Calvin to work, every transaction has to declare its entire access pattern upfront.
You have to define its read set, all the data it needs to read, and its write set, all of its side effects, before execution begins.
So what's the consequence of that constraint?
It means Calvin can't support dynamic transactions.
You know, those procedural or complex SQL operations where the application logic decides what data to read or write halfway through based on the results of an earlier read.
You lose that execution flexibility for deterministic performance.
That's the trade -off.
Let's follow the execution flow across the four worker steps, like you'd see in a diagram like figure 13 to 7.
Okay, so step one is analysis.
The worker analyzes the read and write sets that are provided by the sequencer.
It figures out which reads are local, and it identifies the act of participants, that is, all the nodes that hold data that's going to be modified.
Okay, step two is local data collection.
The worker collects all the data records it needs for the read set that already live on its local node.
And step three is remote data forwarding.
Right, if the transaction needs data from a remote partition for its read set, that data is forwarded over to the act of participant workers.
This is basically the communication needed to gather all the inputs before you start.
And finally, step four, execution and persistence.
The batch is executed, and the results are persisted locally.
What's so elegant here is the lack of a final commit phase.
Right, because every replica executed the exact same operations in the exact same predetermined order, there's no need for a distributed commit protocol like 2PC.
The moment the results are persisted locally, the transaction is considered globally committed.
So Calvin trades that dynamic transaction flexibility for predictable high throughput performance by just eliminating the runtime consensus overhead entirely.
All right, now let's turn to Google Spanner, which takes complexity to a whole new level.
Unlike Calvin, which avoids 2PC, Spanner embraces it.
It does, but it layers it over a consensus infrastructure, which fundamentally changes the protocol's availability profile.
Spanner is all about achieving the holy grail external consistency.
Which is a consistency guarantee that's equivalent to linearizability.
In a globally distributed high availability system.
To do this, Spanner had to solve 2PC's blocking flaw and create a system that could reliably order transactions globally across physically separate continents.
And the innovation that makes this global ordering possible is TrueTime.
TrueTime is Spanner's architectural cornerstone.
It's not just a clock, it's a high precision wall clock API.
It's backed by multiple GPS receivers and atomic clocks at every single data center.
And crucially, it doesn't just give you a timestamp to you.
No, it gives you T plus an uncertainty bound, we'll call it epsilon.
This epsilon dictates the potential error of the clock relative to true global time.
And why is that uncertainty bound so important for transactions?
It allows Spanner to enforce a strict global serialization order.
When a transaction commits, the system uses TrueTime to assign it a timestamp.
To ensure that transaction T1 definitely committed before T2 started, the system has to introduce artificial delays.
It actually has to pause local operations.
Yes, until the uncertainty interval of the assigned commit timestamp has passed.
This guarantees that T1's commit time is truly in the past, everywhere in the world, before T2 can even begin its commitment phase.
Wow.
Let's look at the actual Spanner architecture, which is complex but necessary to understand the system.
Figure 13 -8 in the source material lays this out.
So Spanner servers, or Span servers, manage data in units called tablets.
But the critical layer is that these tablets are attached to Paxos state machines and they form Paxos groups.
So the Paxos group is the fundamental unit of replication and fault tolerance.
Exactly.
And each group has a long -lived leader.
So the leader of the Paxos group is the transaction gatekeeper for that shard.
Every write has to pass through the leader.
It does.
The leader maintains a lock table for two -phase locking, which handles local concurrency, and it acts as the local transaction manager for any multi -shard transactions.
So when a transaction stands multiple shards, how does Spanner handle coordination without blocking?
It uses two -phase commit, but it runs 2pc over the Paxos groups, not over individual servers.
Ah, that's key.
Since Paxos ensures that even if individual servers within a group fail, the group maintains consensus and just elects a new leader, the 2pc process can seamlessly continue.
That's the architectural masterstroke, using consensus to provide fault tolerance to the coordinator role.
That layering of 2pc on top of Paxos is how Spanner mitigates 2pc's inherent availability problem.
So let's track the right path that leverages true time to guarantee that external consistency.
Okay, so a transaction coordinator starts the 2pc process.
The Paxos leader for each participating partition acquires its right locks using 2PO.
It then chooses a right timestamp that's greater than any previously committed transaction, and the preparation phase 1 of 2pc records this via Paxos consensus within the group.
And once all the leaders have successfully prepared, the coordinator generates the final commit timestamp based on true time.
Yes, and here's the crucial part.
Before any locks are released, the Paxos leader has to wait until after the chosen commit timestamp has elapsed, according to the TrueTime API.
This enforced delay.
This enforced delay, which is roughly twice the uncertainty bound of the clock, ensures that clients around the world will only see transaction results whose timestamps are definitively in the past.
This strict rule is what guarantees external consistency.
So the price of that global linearizable consistency is the additional complexity of the Paxos layer, the use of hardware -based atomic clocks, and the latency incurred by that TrueTime wait during every cross -shard commit.
It's a very high engineering cost for absolute certainty.
It's the maximalist approach.
It ensures the highest level of consistency and availability, which makes it suitable for applications where data integrity and global ordering are absolutely paramount.
We focus so heavily on serializability, the gold standard, but as we've noted, it comes with immense costs.
Many real -world systems choose a slightly weaker, but far more performant isolation level.
Snapshot isolation or SI?
Snapshot isolation is really widely adopted because it offers a great balance of consistency and speed.
Its guarantee is simple.
All the reads performed within a transaction are consistent with a snapshot of the database that was taken at the transaction start timestamp.
And what's the key anomaly that SI prevents?
It prevents read skew.
That's the problem where a transaction might read value X, then another concurrent transaction commits changes to both X and Y, and the first transaction then reads the new Y but the old X.
Resulting in an inconsistent view of related data.
Exactly.
Since SI pins the reading transaction to a fixed snapshot, its start timestamp,
it avoids this mixed reading problem entirely.
But SI is famously not serializable.
So what kind of anomaly does it still permit?
It permits write skew.
This happens when two transactions read the same data, make decisions based on those initial reads, and then modify their joint sets of data.
But together they violate some global integrity constraint.
Right.
Since snapshot isolation only checks for direct write conflicts, you know, it's trying to write to the exact same cell, it allows these mutually inconsistent but structurally disjoint writes to commit.
This brings us to Percolator, a system that beautifully demonstrates how to implement transactional semantics, specifically snapshot isolation, on top of the foundational non -transactional distributed key value store, in this case, Google's Bigtable.
Percolator is fascinating because it uses an existing distributed database as its storage layer.
It manages all the consistency above the storage layer using metadata.
So what did the physical storage actually look like under Percolator?
Data records, locks, and write metadata are all stored in separate columns in Bigtable.
And to manage concurrent access, Percolator leverages Bigtable's conditional mutation API.
Which allows it to do atomic operations, like checking if a key exists and writing a lock if it's absent, all in a single reliable remote call.
Exactly.
And what about the time source?
Instead of the complex two -time mechanism, Percolator uses a much lighter weight timestamp oracle.
A client consults this oracle twice per transaction, first for the cluster -wide consistent start timestamp, and then later during commit for the commit timestamp.
And the commitment process itself is a client -driven 2PC, which is pretty unique because the client, not some dedicated internal service, manages the transaction state.
Let's look at phase one, pre -write, as shown in figure 13 .9.
So the client iterates through all the cells it intends to write.
It tries to acquire a lock for each cell.
And one of these locks is designated as the primary lock, which is absolutely vital for recovery.
So during this pre -write phase, it's checking two things.
Yes.
First, that no other locks are present.
And second, that no later transaction has already committed data to that cell.
If either of those checks fails, the transaction just aborts.
If the pre -write succeeds, the data is locked and ready.
Then comes phase two, the commit.
The client first gets the global commit timestamp from the oracle.
Then it uses this timestamp to publish the writes, starting with the primary lock.
It replaces that primary lock with a permanent write record.
And updates the write metadata with the latest commit timestamp.
Once that primary is committed, the client just goes and releases all the secondary locks in the same way.
So what happens if the client fails, say, its machine crashes right in the middle of a commit, leaving all these locks dangling?
This is where the primary locks metadata shines.
If a later unrelated transaction comes along and encounters an incomplete state, an unreleased secondary lock, it traces that lock back to the designated primary lock.
And it tries to resolve it.
It tries to resolve the primary lock.
If the primary lock has been committed, the new transaction helps clean up the remaining secondary locks.
If the primary lock is found unreleased and it's timed out, the new transaction assumes the original one failed and executes a full rollback, ensuring atomicity is restored.
So Percolator very cleverly uses its underlying Bigtable conditional writes to implement a robust, recovery -aware version of 2PC, but one that's optimized for the high performance of snapshot isolation.
We've seen that all these strong consistency protocols, 2PC, Spanner, Calvin, they all incur significant coordination overhead, whether it's upfront or during the commit phase.
And that overhead dramatically hinders scalability, which has led researchers to explore this idea of coordination avoidance.
Right, because when systems need massive scale and high availability,
the latency you get from waiting for remote confirmation is the single biggest bottleneck.
Coordination avoidance seeks architectural properties that allow local execution to proceed independently without needing remote consensus for every single transaction.
And the key enabling concept for this avoidance is invariant confluence or I -confluence.
I -confluence is a technical property.
It asserts that if you have two valid but divergent database states, so they both satisfy the necessary integrity constraints locally, they can still be safely merged back into a single valid state.
So if the invariance can be preserved even after divergence, you don't need to block and coordinate upfront.
Which is a huge performance win.
Local execution is decoupled from remote operations.
So what four properties does a system need to guarantee to achieve this coordination avoidance?
It needs global validity so invariants are satisfied cluster -wide availability, meaning transactions proceed if invariants allow, convergence so all nodes eventually reach the same final state, and critically, coordination freedom.
And that last one means that executing a local operation must not be dependent on a remote operation or some consensus mechanism.
Let's look at a concrete example implementation.
RAMP, which stands for read atomic multi -partition.
How does RAMP achieve efficiency while still providing strong guarantees?
RAMP focuses on providing an isolation level called read atomic isolation.
This is stronger than eventual consistency, but it's weaker than serializability.
What does it guarantee?
It guarantees that any reader will see either all of another transaction's updates or none of them.
This prevents what are called fractured reads, where a client might observe a partial state transition.
And RAMP uses 2PC, but again, in a very limited, highly specialized way.
Right, RAMP uses 2PC strictly to manage write visibility, not for execution agreement or for locking for mutual exclusion.
This is a critical distinction.
So walk us through RAMP's use of the two phases.
Okay, so during the first phase, prepare.
The writes are placed into the target partitions, but they are marked as invisible.
They are durably stored, but no reader can see them yet.
And then the commit phase.
The second phase, commit, atomically publishes these state changes across all partitions simultaneously.
The 2PC mechanism guarantees that the state transition from invisible to visible happens atomically across the entire distributed system.
So by relying on MVCC and the read atomic isolation level, readers and writers can proceed concurrently without having to block each other during execution.
And the only time they pay that 2PC cost is to ensure the final result is published atomically.
It's really elegant.
RAMP shows that 2PC isn't just this archaic blocking mechanism.
It can be repurposed to manage only the atomic visibility of a resulting state, which allows for much higher performance systems than traditional 2PC, which uses the protocol to manage the entire execution and locking lifecycle.
We have completed a pretty comprehensive tour of the inner workings of distributed transaction management.
And it's clear that the choice of protocol is fundamentally an act of defining your system's philosophical priorities.
We started with that classic dilemma, the blocking consensus approach of 2PC.
Simple, but dangerous because of coordinator failure.
That led to the aborted attempt of 3PC, which traded blocking for the catastrophic risk of network induced inconsistency, the split brain.
Then we pivoted to the alternative paradigm of deterministic pre -ordering with Calvin, which eliminated runtime coordination by shifting the cost to the upfront sequencing phase.
But it demanded that applications rigidly define their read and write sets.
And finally, we examined the complex hybrids.
Spanner, layering 2PC over Paxos and using the external temporal anchor of TrueTime to achieve linearizability globally.
And Percolator, cleverly implementing snapshot isolation transactions over an existing distributed store using a client -driven 2PC for atomic locking and recovery.
The key takeaway is always the trade -off.
If you need absolute external consistency and global availability, you have to pay Spanner's price and hardware complexity and TrueTime latency.
But if you prioritize raw throughput and can tolerate slightly weaker isolation, like snapshot isolation, systems like Percolator or Ramp offer better performance.
And if you have non -dynamic transactions,
Calvin offers that deterministic speed.
The overarching trend seems to be away from these centralized fragile coordinator roles and toward foundational fault -tolerant consensus algorithms like Paxos.
These can sustain the decision -making process even when individual machines fail, which reduces those blocking windows associated with the older 2PC models.
So to leave you with a final thought,
when you decide to build a scalable application, you are not just selecting software.
You are selecting an architectural risk profile.
Every message sent, every lock acquired, every microsecond of delay for TrueTime.
That is the sheer cost of guaranteed global atomicity.
So you should reflect on that cost.
How many coordinating steps are you willing to take to ensure that a data transfer executes everywhere or nowhere?
That balance between availability and absolute certainty is really the core challenge of modern data engineering.
Thank you for joining us on this deep dive into the protocols the power of distributed transactions.
Until next time, keep digging into the details.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- Digital Wallet System DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Hotel Reservation System DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Consistency Models & Consensus AlgorithmsDesigning Data-Intensive Applications
- Design a News Feed SystemSystem Design Interview - An Insider's Guide (Volume 1)
- Design a Unique ID Generator (Distributed Systems)System Design Interview - An Insider's Guide (Volume 1)
- Distributed Email Service DesignSystem Design Interview - An Insider's Guide (Volume 2)