Chapter 8: The Trouble with Distributed Systems

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are strapping in, for what is frankly,

a necessary dose of technical paranoia.

That's a good way to put it.

We're plunging into Chapter 8 of Designing Data Intensive Applications, which takes a long hard look at the pessimistic reality of distributed systems.

It's a really crucial chapter because it just forces you to abandon all the simple assumptions we make, you know, when you're just writing code on a single machine.

Right.

So for you, the listener, the mission today is, I think, pretty simple.

We need to fully internalize the difference between a single node system, which is predictable, and a multi -node networked system, which is fundamentally chaotic.

Exactly.

Fundamentally chaotic.

And that distinction is so key.

When you code for one server, if something goes wrong, the system often defaults to what's called a deterministic crash.

You get a blue screen, a kernel panic, a total clean stop.

Right.

The system basically prefers to be entirely broken rather than just subtly wrong.

It's a deliberate choice.

It is.

And it's a luxury that we completely lose the moment we cross the network boundary.

In the world of distributed systems, we have to deal with partial failures.

Okay.

So what does that mean exactly?

It means some components are failing in a completely unpredictable way, while others are working perfectly fine.

And that non -deterministic chaos is the core architectural challenge you have to solve.

And the sources really drive home that these failures aren't just software bugs.

They are physical.

Oh, absolutely.

We're talking about everything from network cards failing, power distribution units going out, to environmental disasters.

We love that example in the text.

The hypoglycemic driver.

Yes.

A hypoglycemic driver running a pickup truck right into the data center's HVAC unit.

It's so specific, you have to assume it actually happened?

It must have.

And if your baseline assumption has to account for, you know, vehicular damage caused by low blood sugar,

your software design just has to be built on extreme caution.

So that's the tone we're setting.

We have to assume the worst.

And we're going to look at three major sources of this chaos today.

Unreliable networks,

unreliable clocks, and then the resulting philosophical meltdown about defining knowledge and truth.

Sounds fun.

Let's start with the harsh reality of those partial failures.

Let's do it.

So let's just unpack that core concept a bit more.

Partial failure.

It dictates everything else.

It means the system is kind of limping.

Yeah, that's a good word for it.

Part of it is failing, but the whole thing is still expected to be available.

And this creates a really fascinating split in how people build big systems.

There are two big philosophies.

Right.

So on one side you have high performance computing.

Think massive scientific supercomputers.

Okay.

When one of their nodes fails, they often just, they stop the whole cluster.

They repair the multi -node and then they restart the simulation from a safe checkpoint.

So they escalate a partial failure into a total but controlled failure.

Procesamente.

They use specialized, super expensive, reliable hardware, and they prioritize correctness over being online all the time.

That sounds incredibly expensive and time consuming.

It is.

But now compare that to the world we all live in.

Cloud computing and internet services.

We have to be online 247.

Always.

And we use commodity hardware.

You know, cheap stuff with much higher failure rates.

We solve the partial failure problem, not by stopping everything, but by building fault tolerance mechanisms directly into the software.

So we're using unreliable parts, but the engineering challenge is to make the whole system reliable despite them?

Yes.

It's like, it's a constant battle against gravity.

It sounds like it.

But we do it all the time.

I mean, the classic example is TCPIP.

The underlying internet protocol, IP, is designed to be unreliable.

It might drop packets, reorder them.

It makes no promises.

None at all.

But TCP, the transmission control protocol, sits right on top of it and works tirelessly to hide all that unreliability.

It handles retransmissions, gets rid of duplicates, puts packets back in order.

The software is the reliability layer.

Okay, that makes sense.

Let's zero in on that first massive source of uncertainty then, the network itself.

The big one.

In most internet services, you see what's called a shared nothing architecture.

And that means all communication happens over an asynchronous packet network.

What's the key implication of asynchronous here?

It just means we have no guarantee of timing.

Zero.

There's no firm upper bound on how long a message will take.

Could be a millisecond.

Or it could be an hour.

Or it might just never arrive at all.

And you, the sender, are just stuck in this state of profound doubt.

And the book lists, what, six catastrophic possibilities when you send a request and just get science back?

Yeah, and they're all equally likely.

Your request could have been lost or the network is just clogged up.

Maybe the other machine actually crashed.

Or maybe it's just paused for a second.

Like, it's doing a huge garbage collection run.

Yeah.

And then the really fun ones.

Maybe it got the request, processed it successfully, but then the response was lost.

Or the response was just delayed.

And here's the brutal truth.

All six of those outcomes are fundamentally indistinguishable to the sender.

You just can't know.

You can't tell if the other guy is dead, slow, paused, or if the network just ate your message.

You are completely in the dark.

That is horrifying if you're an engineer who needs a definitive answer.

So if you can't know the reality, what's the only tool you have to move forward?

The timeout.

It's a necessary evil.

Since you can't know the true state of the remote node, you just set an arbitrary timer.

If you don't hear back, you declare the operation failed.

Or, even worse, you assume the node is dead.

But that's just a guess.

The timeout is an engineered compromise, and it sounds inherently fragile.

Oh, it is.

If I set my timeout really short, I get more speed, but I risk, you know, prematurely declaring a perfectly healthy node dead just because it hit a little bit of congestion.

And that risk isn't just about wasted effort.

It can cause a cascading failure.

How so?

Well, imagine a node is briefly overloaded and it starts running slow.

Your monitoring system, with that aggressive timeout, declares it dead.

And immediately shifts its workload to its neighbors.

Right.

Now those healthy neighbors suddenly have double the traffic.

So they get overloaded, they slow down, and then they get declared dead.

A tiny momentary slowdown becomes an existential crisis for the entire cluster.

So the problem isn't just about raw network speed.

It's about the unbounded delay.

And that variability, it mostly comes down to queuing, right?

Absolutely.

Queuing is the enemy of predictable timing.

Packets queue up in network switches, they queue up in the operating system waiting for the CPU.

And in virtual environments, I imagine it's even worse.

Much worse.

VM pauses, context switching.

These can cause huge delays, buffering requests across multiple layers.

And we tolerate all of this because of a massive cost -benefit trade -off.

We use asynchronous packet switching, like Ethernet and IP, because it's cheap and it maximizes resource utilization.

We can handle bursty traffic.

Exactly.

You know, contrast that with the old traditional telephone network.

That was synchronous circuit switching.

It guaranteed a maximum delay because it would reserve a fixed, dedicated bandwidth slot for your call.

No queuing, no variability.

But it was super inefficient with resources, slow to set up, and very costly.

So we choose high utilization and low cost.

And the price we pay is all this complexity.

We have to handle these non -deterministic delays in our software.

Okay.

So if the network is lying to us about timing,

let's look at the second major source of confusion.

The local time inside the machine itself.

I always thought a computer clock was just a clock.

Why does the chapter break it down into two different types?

Because they serve fundamentally different purposes.

And if you confuse them, you get really serious distributed bugs.

Okay.

So first you have the time of day clock, the wall clock.

This is what returns the current date and time.

It's synchronized using NTP, the network time protocol.

But, and this is the key, it can be adjusted.

It can suddenly jump forward or even backward.

Backward.

Why would it jump backward?

Big corrections from the NTP server.

Or because of leap seconds.

So because it can jump, that makes it totally unreliable for measuring system behavior.

It's unsuitable for measuring durations.

Correct.

If you try to time something using the wall clock, your duration might come out negative if a big NTP correction happens in the middle.

Wow.

And that's why we need the second type, the monotonic clock.

Okay.

And the monotonic clock is?

It's guaranteed to always move forward.

It's perfect for measuring intervals, like timing that critical 50 millisecond service response, because it never jumps.

But, and here's the other catch, its absolute value is totally arbitrary and meaningless outside of that one machine.

You can't compare it to another machine's monotonic clock.

So even with the best tech like NTP, synchronization is limited.

Very limited.

The hardware clocks themselves suffer from clock drift.

They just naturally run a bit faster slow.

NTP tries to compensate, but its accuracy is limited by the network delay to the time server.

You're often left with an error of tens of milliseconds between machines.

So there's this inherent unavoidable error between any two synchronized machines.

But the really insidious problem, this connects back to the network section, is process pauses.

This is what's truly terrifying.

A node can suddenly just freeze for an unbounded amount of time, and the application thread has no idea it's stopped executing.

It's like a mini -crash that the node itself can't even detect.

Exactly.

And these pauses aren't exotic things.

We see them all the time.

A stop -the -world garbage collection, or GC event in Java or Go, can easily freeze a process for seconds, or longer.

Or if your machine is virtualized, a live migration to the VM, or even just the OS paging memory to disk, can pause your thread arbitrarily.

So walk me through why that's so dangerous.

Okay.

Imagine a node gets elected leader, and it gets a time -based lease, say, for 30 seconds.

Right.

Then that process gets hit with a GC pause for 45 seconds.

The rest of the world sees that its 30 -second lease expired, and they elect a new leader.

But when the original node wakes up?

When it wakes up, it still believes its lease is valid.

Its internal monotonic clock didn't run during the pause.

So it now exists outside the current reality of the system.

It's a time traveler.

That terrifying scenario, the node that wakes up and thinks it's still the boss, that brings us right to the fundamental philosophical crisis of this chapter.

How do we establish truth when every piece of information is unreliable?

You can't.

I mean, a node in a distributed system cannot know anything for sure.

It can only make educated guesses based on the unreliable messages it receives.

When a node is paused, it becomes that time traveler operating on stale information.

We simply cannot rely on any single machine's judgment.

So we need external arbitration.

This is where the concept of a quorum comes in, right?

Yes.

A quorum is typically a majority so, more than half of the nodes in a system.

By requiring a quorum to agree on something.

Like electing a leader or committing a right.

Or even declaring a node dead.

By doing that, we minimize the chance that one single paused or failed node can lead the entire system astray.

If a majority says a node is dead, that node must step down regardless of what it thinks internally.

Let's use a practical example to really show the danger here.

The book has this great description of the leader and the lock problem.

It's a classic.

So a client needs to update a record and it asks a lock server for a time limited lease.

Okay, it gets the lock for say 30 seconds and starts its work.

Mid -update, the client gets hit with a huge GC pause for 45 seconds.

A lock expires.

Meanwhile, a second healthy client comes along, requests the lock, gets it, and starts its own write on that same record.

And when the first client finally wakes up.

It has no idea it was paused for 45 seconds.

It just sees it got a lock 30 seconds ago.

It thinks the lock is still valid and it finishes its stale write operation.

And corrupts the correct data that was just written by the second client.

And just like that, your data integrity is shattered because of a timing violation caused by an unpredictable pause.

This is where we need a safety mechanism that isn't based on time.

Which brings us to fencing tokens.

Such an elegant solution to this time traveler problem.

Explain how they work.

It's actually pretty simple.

Instead of relying on time whenever the lock server grants a lock, it also returns a monotonically increasing token.

It's just a number that always goes up.

So client A gets the lock, it gets token one.

After client A pauses and its lease expires, client B gets the lock and gets token two.

Exactly.

And the critical step is that the client has to include that fencing token with every single write request it sends to the storage service.

The storage service then checks the token.

Ah, so it acts like a gatekeeper.

It does.

If it gets a request with a token that's older or lower than a token it has already processed, it just rejects the operation.

So when our paused client A wakes up and tries to write with its old token one.

The storage server has already seen token two from client B and it immediately says, note you're too old and rejects the stale request.

It prevents the corruption.

It moves the responsibility for checking dilidity from the unreliable client to the trusted storage layer.

That's how you build safety.

We should probably just briefly mention Byzantine faults here too.

Right, where nodes can actually lie.

Yeah,

everything we've discussed, drop packets, pause nodes, assumes the nodes are unreliable but honest.

They fail, but they don't deliberately send malicious data.

Dealing with true Byzantine faults where nodes might lie requires even more complex and expensive protocols.

Most commodity systems wisely avoid it.

And to structure all this chaos, we rely on what are called system models.

These are just formal ways to reason about failure.

Right, and on the timing side, the most useful and realistic model is called partially synchronous.

That means...

That means we assume the system is synchronous.

Most of the time messages arrive within some fixed bound.

But those timing guarantees can be shattered occasionally during network congestion, during long pauses.

And for node failures...

We generally assume the crash recovery model.

Nodes might crash, but we assume they preserve their state on stable storage and can come back online later.

So after this, this tour of technical horrors, what is the core lesson for engineers?

It seems to come down to distinguishing between two critical properties, safety and liveness.

That's the heart of it.

Safety means nothing bad ever happens.

This is all about data integrity.

You know, making sure two clients never hold the same lock or that data is never corrupted.

And if safety is violated, the consequences are permanent.

They are.

And your algorithms have to ensure safety holds even under the absolute worst fault conditions.

And then there's liveness.

Which means something good eventually happens, that a request eventually gets processed, or that the system is available.

But liveness properties usually come with caveats.

They often only hold if the network eventually recovers and stays stable for a little while.

So algorithms always have to prioritize safety over liveness.

Always.

Because we would much rather fail to respond than give a corrupt or incorrect response.

The big takeaway from this whole deep dive seems to be that the distributed world is messy by design.

It really is.

We choose cheap commodity hardware and high resource utilization, which forces us into this asynchronous world with its unpredictable delays.

The mountain of complexity we've talked about, the timeouts, the quorums, the fencing tokens.

That's the price we pay for that cheap, efficient infrastructure.

So if you can possibly use a single computer.

For goodness sake, use it.

But if you must build distributed systems, you now know the tools to embrace the paranoia.

We've established that safety is achievable even under these chaotic assumptions.

And since we know how to define truth and deal with failure, the next logical question to ask is, how do we achieve consensus?

How do all these nodes faced with unreliable networks and unreliable clocks actually manage to agree on a single linear history of events?

That sounds like the ultimate challenge.

It is the ultimate goal of distributed systems and is what we'll dive into next time.

A perfect foundation for that next challenge.

Thanks for joining us for the deep dive.

My pleasure.

We'll see you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Distributed systems present fundamental challenges that distinguish them sharply from single-machine software, primarily through the inevitability of partial failure where certain components malfunction while others remain operational. The network itself serves as a primary source of unpredictability, functioning as an asynchronous packet-switched medium that cannot guarantee message delivery or bound transmission delays, making it impossible for a sender to reliably determine whether a request was lost, a remote node is unresponsive, or a response failed to arrive. Fault detection mechanisms rely on timeouts, yet configuring these accurately proves difficult because network delays vary substantially due to congestion at switches, operating system scheduling behavior, and pauses introduced by virtualization layers. This variability stems from dynamic packet switching, designed to handle bursty traffic patterns efficiently, unlike circuit-switched networks that provide bounded latencies while sacrificing resource efficiency. Clock management introduces additional complications, as hardware clocks naturally drift and Network Time Protocol synchronization cannot overcome fundamental limitations imposed by network propagation delays. Monotonic clocks measure elapsed time reliably, while time-of-day clocks present hazards through discontinuous jumps, accumulated drift, and leap second anomalies. Ordering events using unsynchronized time-of-day clocks, particularly in last-write-wins strategies, risks silent data loss when causally subsequent writes receive earlier timestamps, violating causality assumptions. Processes further suffer unbounded pauses from garbage collection execution, virtual machine migration, or synchronous I/O operations, creating scenarios where nodes incorrectly believe expired resource locks remain valid, potentially enabling data corruption. Solving these challenges requires distributed algorithms explicitly designed for realistic system models, most notably the partially synchronous model assuming crash-recovery faults with honest but unreliable nodes rather than Byzantine faults. Because individual nodes cannot rely on isolated knowledge, critical system decisions such as declaring nodes failed must achieve agreement through quorum-based voting. Preventing corrupted writes from zombie nodes holding expired locks necessitates fencing tokens, monotonically increasing identifiers issued by lock services that resource servers validate to reject outdated requests. Formal correctness divides into safety properties, guaranteeing undesirable outcomes never occur, and liveness properties, ensuring systems eventually recover to functional states.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 8: The Trouble with Distributed Systems

Related Chapters