Chapter 5: Replication in Distributed Data Systems

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are tackling, well, the dark art of modern data systems.

Replication.

How do you keep the same data everywhere?

It's a fundamental problem.

It is.

And this dive is essentially a shortcut to understanding chapter five of, you know, one of the foundational texts in system design.

And if we need a philosophical starting point, I think we have to borrow from the great Douglas Adams.

Oh, I think I know which one you mean.

He said the major difference between a thing that might go wrong and a thing that cannot possibly go wrong.

Is that when a thing that cannot possibly go wrong goes wrong.

It usually turns out to be impossible to get at or repair.

And that quote is just perfect because replication, I mean, simply making copies of data on multiple machines, it sounds so simple,

but it is the definition of a necessary complexity.

Right.

The moment that data starts changing, you hit those impossible problems Adams describes.

Okay, let's unpack this then.

We replicate for, I think, three clear, compelling reasons.

We need the data close to the user to, you know, reduce latency.

Speed, yeah.

We need to tolerate hardware or network failure.

So we need increased availability.

Can have the system go down.

And most commonly, we need to handle massive traffic.

So we have to scale read throughput.

Exactly.

And our deep dive today is going to focus entirely on the complexity that comes from handling changes to that replicated data.

Okay.

We are going to examine the three fundamental architectural patterns that dictate how those changes are coordinated.

Single leader, multi -leader, and leaderless replication.

And these models, they really define the fundamental trade -offs in distributed systems design, don't they?

They absolutely do.

So let's begin with the workhorse of the industry.

Single leader replication.

Some people might call it active passive or master slave.

Right.

This model is all about enforcing a strict ordering of events.

You have one designated leader, the primary or the master, and this node,

it's the only one permitted to accept client write requests.

Only one.

It logs that change locally, and then it forwards that change log to all the other nodes.

And those other nodes are the followers or secondaries.

Their only job is to apply those changes in the same exact sequence they receive them, so they eventually catch up to the leader.

Right.

Clients can read from any node they want, but writes.

Rights have to go to that single leader.

So the core decision here, and it feels massive, is how the leader confirms that write.

This is where we get the split between synchronous and asynchronous replication.

It's a huge decision.

With synchronous replication, the leader waits.

It doesn't tell the client, yep, your write succeeded until at least one follower confirms they have it.

So the benefit is pretty immediate there.

High durability.

The data is safe because it exists in at least two places before you even get a confirmation.

But the drawback is severe if that one synchronous follower fails or the network link breaks.

The whole system just stops.

The entire system blocks.

No writes can proceed.

Because of this, you know, this single point of failure risk, you rarely see systems run all their replicas synchronously.

Which is why we often see semi -synchronous, maybe one follower sync for that basic durability guarantee, and the rest are just asynchronous.

Exactly.

And with asynchronous, the leader confirms the write to the client immediately, sends the change, and just moves on.

It's much faster, much higher availability.

But there has to be a catch.

And here is the critical trade -off that system designers have to consciously accept.

If the leader fails before that asynchronous replication finishes,

those committed writes are gone.

They just vanish.

No.

That's a huge risk to accept, isn't it?

You're basically choosing speed and availability over the absolute guarantee of durability for like a tiny window of time.

And so many systems choose this path.

Because for many internet applications, at a fraction of a second you save on latency and the increased availability during short network hiccups, it just dramatically improves the user experience.

It's a pragmatic compromise.

Okay, let's shift to operational complexity.

If a follower fails, that seems simple enough.

It just recovers using its local log.

But what happens when the leader fails?

That requires failover, which is inherently dangerous.

It's a three -step dance.

Okay.

Step one, detecting the failure, usually with a timeout, which can be really tricky during temporary network issues.

Step two,

choosing a new leader.

This requires consensus.

The new leader should be the follower that has processed the most changes.

The most up -to -date one.

And step three, reconfiguring all the old followers and all the clients to recognize the new boss.

And the dangers here are twofold, right?

If replication was asynchronous, you have discarded rights.

The system just loses data.

But the worst danger is the split brain scenario.

Split brain is the nightmare.

It's where two nodes mistakenly believe they are the leader.

They both start accepting rights and you get irreversible data corruption.

And we have real world examples showing how bad this can get.

The GitHub incident mentioned in the sources.

It's pretty stark.

It is.

A lagged follower was promoted to leader during a failover and because its data was old, it reused an old sequence number for its auto incrementing primary keys.

And when the original fresh leader eventually came back online, that key collision caused data integrity problems that just, they rippled across all their other dependent systems.

It proves that failover is never perfectly safe.

That really puts the risk into perspective.

Okay.

Now let's leave the who, the leader and follower and talk about how the changes are actually communicated.

The replication logs.

We have four main methods here.

The first and I'd say the least reliable is statement -based replication.

You just log and send the raw SQL statements, insert into, update.

And why does that break down so often?

Non -determinism.

Functions like nw or rand will produce different results when they're executed on the leader versus the follower.

Plus with concurrent transactions, the order becomes incredibly fragile.

Okay.

So that one's out.

Next up is write -ahead log shipping or wall shipping used by heavyweights like Postgresql and Oracle.

Right.

This is super efficient because it's sending the very low level byte -by -byte changes that the storage engine already uses for durability.

But the trade -off is fascinating.

Replication becomes tightly coupled to the internal storage engine format and critically to the software version.

So if I want to upgrade my database software, I can't just upgrade the leader and then the followers one by one without breaking things.

Exactly.

You're locked in.

You often cannot run different versions of the software on the leader and followers at the same time, which makes zero downtime upgrades exceedingly difficult.

And that operational constraint is a major reason why the third method is gaining popularity,

logical log replication.

This is the intelligent middle ground.

It decouples the replication log from the physical storage engine internals.

How does it do that?

Instead of bytes, changes are logged at the granularity of a row.

You know, the old and new values for an update or the primary key for delete.

This log format is much easier to maintain for backward compatibility.

And crucially, much easier for external applications to read and understand.

That's the key.

And that external consumption is known as change data capture or CDC, which is essential for feeding things like data warehouses and search indices.

You're turning your database into a stream.

Right.

And finally, we have trigger based replication.

It's less common for core functionality, but useful for, say, replicating only a subset of data.

But it comes with a lot of performance overhead.

Okay.

So now let's move to the consequence of using all this asynchronous replication to scale weeds, replication lag.

This brings us into the world of eventual consistency.

And that term, eventually, it's terrifying because it's so vague.

Under normal load, eventually might be milliseconds.

But if there's network congestion or followers overloaded, eventually could stretch to minutes.

And that causes real visible chaos in the application.

So we need to protect the user from seeing that chaos.

Let's break down the three fundamental consistency anomalies caused by lag.

Number one, read your light's consistency.

So you, the user, you write a new profile update to the leader.

You immediately refresh your page, but your read request hits a stale follower.

And my update is gone.

It's missing.

This immediately breaks trust in the application.

So what's the strategy to fix this without forcing every single read to hit the leader, which would defeat the purpose of scaling?

You need smart routing.

For instance, you could associate the user with the logical timestamp, the LSN of their most recent write.

When they request a read, you only serve that request from a follower that is guaranteed to have processed at least up to that specific LSN.

And that gets even more complex if the user moves from their phone to their laptop, right?

You need that consistency guarantee to follow them across devices.

It does.

Now the second anomaly is monotonic reads.

This is where the user sees time move backward.

This sounds disorienting.

It really is.

Imagine you see a new comment in a thread because your first request hit a fast, up -to -date replica.

Two seconds later, you refresh, your request hits a heavily lagging replica, and that comment is suddenly gone.

Only to reappear later when I hit a fresh replica again.

To prevent that, you have to guarantee that once a user sees a particular state, they never see an older state.

The common fix here is sticky reads, right?

You just hash the user ID and make sure that user always reads from the same designated follower.

That's the simplest way.

They might see stale data for a bit, but at least the data won't jump back and forth in time.

And finally, the most challenging anomaly, which deals with causality, is consistent prefix reads.

This is our Mrs.

Cake and Mr.

Prune's example, where Mrs.

Cake's answer appears before Mr.

Prune's question.

A sequence of causally dependent writes.

The question has to come before the answer.

They get replicated through different network paths, arrive out of order, and violate our sense of logical time.

And this is especially difficult when data is partitioned, which I guess we'll get to in the next chapter.

The key takeaway here is that once you introduce asynchronous replication,

you cannot rely on the database alone to maintain consistency.

The application layer has to implement these protective measures.

Okay, let's explore our second model.

Multi -leader replication, or active -active.

This is where multiple nodes can accept writes and act as both leaders and followers at the same time.

This introduces a lot of complexity right away, but it's invaluable for a few specific scenarios.

The first is multi -data center operations.

Right, if you have data centers in London and New York, you can have a leader in both.

Local writes are fast and replication happens asynchronously across the ocean.

It dramatically improves performance and tolerance for entire data center outages.

It also works well for clients with offline operation like a collaborative note app, where your device is its own leader and syncs when it's connected.

But the immediate cost is the write conflict.

Since two different leaders can accept a change to the exact same record at the same time.

The conflict is only detected asynchronously later when the change streams cross.

So how do you guarantee the system converges to a single correct final state?

The goal is convergent resolution.

The simplest but most dangerous method is last write wins, LWW,

where the change with the largest timestamp or unique ID wins.

But LWW is so prone to data loss because it completely ignores the intent of the operation.

Right.

We loved the Amazon shopping cart story where the LWW logic prioritized additions over deletions, causing items that users had removed to reappear over and over again.

That's it.

It shows that simple automated rules are just fraught with risk.

You need either custom logic executed by the application, which can be run on write or on read, or you rely on more sophisticated mathematical approaches.

And this is where things like CRDTs come in, right?

Exactly.

Conflict free replicated data types.

They are special data structures like counters or sets that are mathematically designed so can be merged automatically in any order and still guarantee convergence without losing data.

It's a very promising area for modern collaborative tools.

Okay.

Let's jump to our final model, the most decentralized one, leaderless replication.

This is the dynamo style used by systems like Cassandra and React.

This flips the script.

There is no leader to enforce ordering.

Any replica can accept a write.

The entire framework relies on quorums.

And then these are defined by three parameters, N, W, and R.

Right.

Another all is the told number of replicas.

Dollars is the number of nodes that must confirm a write for it to succeed.

And three dollars is the number of nodes you query for a read.

And here is the fundamental mathematical brilliance.

The system is designed to be consistent if dollar plus rons is greater than dollar.

Dollar plus rn.

So why does that work?

Because that mathematical inequality guarantees that the set of nodes you wrote to and the set of nodes you read from must overlap by at least one node.

And that overlap guarantees you will always encounter the latest committed value.

That's the core safety net.

Now, if a node is temporarily unavailable, you just continue, provided you can still meet your dollar or twaller quorum counts.

And how do stale nodes catch up?

Two mechanisms.

First, read repair.

If a client reads conflicting versions, it writes the newest version back to the lagging replica.

Second, an anti -entropy process.

Just a background mechanism constantly scanning for differences and copying missing data.

This model sounds incredibly resilient, but what happens during a severe network partition?

That forces a choice between availability and consistency.

If we stick to strict quorums, writes will fail during a partition.

So to maintain a high write availability, systems introduce sloppy quorums.

Okay, what does that mean?

It means if a target node is unreachable, the write is temporarily accepted by another reachable node, a hinted handoff.

But by allowing a node outside the designated end set to accept the write, you're breaking the dollar plus R in a long guarantee.

Precisely.

Sloppy quorums prioritize availability above all else during a storm.

They ensure the data is durable, it's recorded somewhere, but they explicitly sacrifice that strict consistency assurance.

So we arrive back at the same core complexity as multi -leader systems, detecting and resolving concurrent writes since there's no central leader to impose an order.

Right, and we use the idea of the happens before relationship to define concurrency.

Two operations are concurrent if neither one is causally aware of the other.

And to manage this without a leader, we use version vectors.

So how do these version vectors work conceptually?

Think of it as a historical ticket the client carries.

When you read a key, the database gives you the data plus a version vector summarizing its causal history across all the replicas.

Then when you write back, you include that history ticket.

You do.

And this allows the database to distinguish.

Is this a simple overwrite where the new write is causally after the old one?

Or is this a concurrent sibling where two clients independently change the data?

And if the database detects concurrent siblings, It returns all of them to you.

And the critical detail is that the application is responsible for merging those siblings into a unified value before writing it back.

And if you delete a value, You can't just remove it because a lagging replica might reintroduce it.

You have to write a special marker called a tombstone to ensure the deletion sticks.

Wow.

We just covered a remarkable amount of complexity.

To quickly recap for you, our listeners, we analyzed three architectures.

Single leader for strict ordering.

Multi leader for multi data center efficiency, but with asynchronous conflicts.

And leaderless, leveraging quorums for availability, but demanding complex concurrency resolution with things like version vectors.

And we also detailed the three application facing consistency models that safeguard the user experience.

Read your whites consistency, monotonic reads, and consistent prefix reads.

Each one represents a fundamental guarantee that distributed systems have to strive for.

It's crystal clear that replication is hard, but what's fascinating here is that every single we discussed assumes the entire data set can fit on every machine.

That's right.

And if your data set grows too large, you can no longer replicate everything to every node.

You have to split your data across machines, a process known as partitioning or sharding.

And that segmentation.

It introduces a whole new universe of complexity when you're trying to maintain consistency across different non -overlapping parts of your system.

And that, I suspect, is exactly where our next deep dive will take us.

Thank you for guiding us through this foundational and really dense material.

My pleasure.

And thank you, our listeners, for joining us on the deep dive.

We hope you feel thoroughly informed.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Replication across distributed systems serves multiple critical objectives: enhancing system availability, expanding read capacity, and minimizing latency by positioning data closer to end users. The central challenge lies in efficiently coordinating data modifications when copies exist on multiple machines. Three primary architectural patterns address this challenge, each with distinct tradeoffs. Single-leader replication designates one authoritative node to process all write operations, then broadcasts these changes to replica nodes via transaction logs or event streams. This model forces a choice between synchronous replication, which guarantees consistency but creates vulnerability to node outages, and asynchronous replication, which improves throughput at the cost of potential data loss during failover events. When the leader fails, the system must detect this failure, elect a successor node through consensus mechanisms, and reconfigure all replicas—a process fraught with risks including the loss of uncommitted writes and split brain scenarios where multiple nodes claim leadership. Three implementation techniques include statement-based replication, which falters when functions produce nondeterministic results; Write-Ahead Log shipping, which tightly binds the storage mechanism to replication logic; and logical log replication, which decouples these concerns. Asynchronous followers inevitably lag behind the leader, creating eventual consistency conditions that produce application-level anomalies. Ensuring read-after-write consistency prevents users from seeing stale versions of their own submissions, monotonic reads prevent backward time travel in data visibility, and consistent prefix reads maintain the causal ordering of related operations. Multi-leader replication extends write acceptance to multiple nodes, delivering superior local performance and resilience in multi-datacenter environments or offline scenarios. Its primary drawback is the necessity for conflict resolution when concurrent writes target the same data. Last Write Wins strategies risk silent data loss, while Conflict-Free Replicated Datatypes offer structured approaches to automatic resolution. Leaderless replication eliminates the leader node entirely, permitting any replica to accept writes and relying instead on quorum-based mechanisms defined by total replica count, minimum write confirmations, and minimum read participants. Convergence occurs when write and read quorum parameters satisfy the intersection property. Stale reads are handled through read repair or background anti-entropy processes. During network partitions, sloppy quorums and hinted handoff preserve availability while accepting consistency degradation. Version vectors enable accurate identification and merging of concurrent write branches across replicas.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 5: Replication in Distributed Data Systems

Related Chapters