Chapter 8: Distributed Systems: Introduction and

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

You know, think about the last time you hit save on a document or maybe you processed a complex financial transaction.

We interact with these massive data systems every single day and we all operate under this this really comfortable assumption that it all just works.

It just works.

It's reliable.

It's fast.

And when you write data, it's, you know, instantly and permanently there.

So today we're going to strip away that assumption completely.

We're throwing it out the window.

We are taking the hardest possible look at what happens when systems grow beyond the comfortable confines of just one computer and that shift from one machine to many.

That's where the entire landscape of computing changes.

Every single guarantee you took for granted.

Gone.

Predictable outcomes, instantaneous communication, deterministic timing.

It all vanishes.

So what are we diving into today?

This deep dive is based on a foundational chapter in database internals.

And it's so, so essential because, you know, it defines the core vocabulary, the obstacles you just can't avoid and the fundamental trade -offs.

And every single topic that comes after replication,

consistency, consensus, it all has to overcome these problems.

So our mission today is what exactly?

Our mission is to dive into the, well, the hostile environment of distributed computing.

We're going to define communication reliability, timing assumptions, and how to handle failures

in systems that are built entirely on asynchronous message passing.

Okay.

So let's unpack this.

Where do we even start?

Maybe, maybe by starting small, right?

Where that predictability first breaks down.

Exactly.

Before we even talk about tables and network latency, let's just talk about concurrency.

We're still on one computer, but we've gone from one linear set of instructions to multiple threads sharing one variable.

What happens before we even cross a single node boundary?

Well, we have to begin in the comfort zone.

The single thread embedded program.

The safe space.

It's the safe space.

If you define a variable int i1 and then you execute steps sequentially, i plus a2, i or two, you get one execution history.

And the result is always six.

Always six.

It's deterministic.

It's simple.

It's happy math.

This is the world of single machine systems where the CPU scheduler and the operating system make sure everything happens one after the other.

Right.

This is the baseline of

world we're trying so desperately to mimic in the messy distributed world.

And now let's introduce the challenge.

So we have a shared variable.

Let's call it $6 and it's initialized to $6.

Okay.

And we introduce two separate threads running concurrently.

We got the adder, which does X plus equals two and the multiplier, which does X yields two.

Simple enough.

You'd think so.

But the key insight here is that neither of those operations is truly atomic.

They aren't a single instant action.

They are sequences of micro steps.

First, you have to read the shared variable $6.

Second, you perform a local calculation.

And third, you write the new value back to $6.

And because those three little steps from the adder can get mixed up with the three little steps from the multiplier in any order, the outcome becomes wildly non -deterministic.

This is where chaos enters the picture.

Precisely.

If those steps are not perfectly synchronized, if we let the CPU scheduler just arbitrarily interrupt the process between the read and the write,

we get a whole bunch of possible outcomes.

The source material actually outlines four potential ways they can interleave and it perfectly illustrates the problem of managing shared state.

Okay.

Let's walk through the dangerous scenarios first because they really highlight what we're trying to prevent.

Absolutely.

So scenario A, imagine both the adder and the multiplier threads start by reading the initial value, 7 by an all 0 1 1, before either one of them writes back.

They both have stale data from the get -go.

Exactly.

So the adder calculates its result.

One of those two is three and it writes by three three dollars.

Immediately after, the multiplier calculates its result.

One times two is two and it overwrites the adder's result.

So the final result is seven and two two.

The entire addition operation was just lost.

Completely lost.

Even though the adder thread ran perfectly and wrote its result, that result was based on stale data and its final write was immediately obliterated by the multiplier, which was also working from stale data.

Correct.

And scenario B is just the reverse form of that loss.

They both read 6 by 9 and 1.

The multiplier writes 6 by 8 and 2.

Then the adder calculates one plus two is three and overwrites the multiplier's result.

So now the outcome is six by an all four dollars.

The outcome is six by eight of six three.

And again, one operation is completely gone.

This is the classic lost update problem.

It's a core failure mode and database concurrency control.

That's fascinating.

Just the timing of the two read operations relative to the two write operations gives us two outcomes where the system thinks both operations completed, but only one is actually there in the end.

Now let's look at the third and fourth scenarios.

Here the interleaving is a little better structured.

It reflects a sequential history.

In scenario C, the multiplier reads the initial value six, seven, and one calculates one times two is two and writes six by two.

Okay.

Then the

The adder reads the newly updated value six by two dollars calculates two plus two is four and writes six by four dollars.

The result is six by four and four dollars.

This isn't the six by six C's we were hoping for from a perfect sequential execution, but at least both processes operated on values that were, you know, derived from the previous state, not just overwriting each other.

Right.

And finally, scenario D, which gives us the intended

The adder reads six by one one calculates one plus two is three and writes six by three.

Then the multiplier reads that updated value 60 by three calculates three times two is six and writes six by six.

And the final result is six by six a code.

So just by allowing two threads to access one variable without any kind of synchronized locking, we've generated four possible results, two, three, four, or six.

This non -determinism, this is the core structural challenge and it leads us to the critical takeaway.

It does.

The critical takeaway is that concurrency is the first distributed systems problem before you even have a network before a single network cables involved.

If you consider that shared variable six dollars as say a single record in a database and the two threads is two independent servers trying to update that record, you realize the exact same failure scenarios can happen.

They play out over the network via message passing threads access shared state.

They do local calculations and they propagate results perfectly mimicking independent processes communicating across a network.

And if we don't want four possible answers for simple arithmetic in our database, we need hard rules to constrain the system's behavior.

Exactly.

The solution is what we call consistency models.

These models define the legal execution histories.

They establish a required order in which operations have to be executed and their results made visible to everyone.

So by applying a strong consistency model like strict serializability, we aim to rule out those undesirable non -deterministic outcomes like Stu bias, who fell for six by three, three dollars.

We want to ensure the final state always reflects some total sequential order of operations.

Okay, before we move on to the network, let's just quickly clarify the semantic difference between concurrent and distributed because people use those terms interchangeably all the time.

Yeah, that's a good point.

The source material clarifies really nicely with a coffee analogy.

Concurrent operations overlap in time.

So think of it as two cues leading to a single coffee machine.

Both tasks are in progress.

They're waiting for turns on that shared resource, but only one step is actually executing at any given moment and parallel operation.

Parallel operations execute simultaneously.

So it's two cues leading to two coffee machines.

Oh, okay.

They happen truly at the same time.

But the more fundamental difference for us is the communication method.

Concurrent systems often share memory, meaning they can directly access the same physical location and memory like our variable six dollars.

Right.

Distributed systems, by definition, rely entirely on message passing between processors and each processor maintains its own local state.

That message passing is what introduces all the network uncertainty we're about to get into.

So we've learned that even local access to shared state is non -deterministic.

Now we try to coordinate state across a network using message passing.

What's the new layer of uncertainty that immediately paralyzes a distributed system designer?

It's the sheer ambiguity of failure.

What do you mean?

When a process sends a message like update record six dollars to six towns to a database node across the network and doesn't get a response, the basic question is why?

Right.

The message could be lost.

The database node could be overloaded and slow.

It could have crashed or it could have successfully processed the update, but the return acknowledgement got lost on the way back.

So we don't just not know the reason.

We don't even know if the operation succeeded or failed.

This is the huge difference between calling a local function and making a remote procedure call.

And it forces us to formalize our uncertainty.

We have to explicitly define our system in terms of synchrony.

Are there any timing assumptions?

Our message is guaranteed to be delivered within a bounded time.

And if we can assume bounds, then we can introduce timeouts and retries.

We also need a really rigorous failure model,

a precise description of how components fail.

Is it a crash stop?

Do they just omit messages?

Are they temporary failures?

And the entire point of defining all these models is to achieve fault tolerance.

Exactly.

Failures are an engineering certainty, not a possibility.

And fault tolerance is the property of a system that lets it keep operating correctly, delivering its service even in the presence of those inevitable failures.

And we usually achieve this through redundancy.

Usually through redundancy, yeah, meaning we add backup nodes.

But of course, as soon as you add redundancy, you are immediately thrown right back into the complexities of concurrency control we just talked about, only now it's magnified by network uncertainty.

So we really need to step back and describe the communication medium itself.

And that medium, as we're about to discuss, is basically built on a series of broken promises.

That's a good way to put it.

So if you try to build a distributed system, assuming everything behaves like a local program, you are destined for disaster.

Peter Deutsch, James Gosling, and Bill Joy famously cataloged these dangerous assumptions.

They called them the eight fallacies of distributed computing.

And these are the misconceptions that doom systems before a single line of code is even committed.

They're so powerful because they expose the huge gap between our idealized software models and the physical unreliable reality of networking hardware and physics.

Let's break down the first three.

They deal directly with the network link itself.

All right.

So fallacy one, the network is reliable.

This seems like a basic expectation, right?

But it's often wrong.

How so?

A successful PCP handshake only proves the connection can be established at that moment.

But the link itself is subject to constant interruption.

Messages get lost.

Cables get disconnected.

There are configuration errors.

Switches can fail.

And you mentioned the most insidious scenario is the partial failure.

It is.

The request reaches the server.

It's processed successfully.

But the acknowledgement, the response gets lost on the way back.

So from the sender's perspective, it looks like a total failure, even though the state change was successfully applied on the remote side.

And that forces the designer to make a really difficult choice.

It's the classic dilemma.

Do you retry and risk a duplicate operation, or do you assume failure and just halt the process?

Tough call.

Then we move to the cause of that communication.

Fallacy two.

Latency is zero.

Remote calls are drastically, drastically slower than local function calls.

When a message is sent, it travels through so many software layers.

The application, the OS kernel, the network card driver, and then it has to traverse the physical medium.

And even at the speed of light.

Even at

optical fiber cables introduce measurable delays.

Let's pause on the Flashboys example you mentioned because it so vividly demonstrates how expensive these microlatencies are in the real world.

It's a stunning illustration of physical constraints driving economic behavior.

I mean, high frequency trading firms invested hundreds of millions of dollars to build dedicated microwave links or lay entirely new fiber optic cables.

Just to shave off a few milliseconds.

A few milliseconds.

All for the singular purpose of shaving milliseconds off round trip times between, say, Chicago and New York.

The goal was to get market data or execute a trade just a fraction of a millisecond before a competitor.

Wow.

So assuming zero latency means you fundamentally misunderstand the physics and the economic weight of that delay.

It's not a technical footnote.

It's a huge operational cost.

And even if we accept that there's a delay, we might assume the capacity is And that's fallacy three.

Bandwidth is infinite.

The moment you scale up your processes or increase your message rates or handle bigger data payloads, resource contention becomes immediate.

Even if the network supports massive bandwidth, the actual effective bandwidth available to any single process is finite.

It's shared and it's subject to immediate congestion as demand goes up.

Which is why we have things like load balancers and traffic shaping.

You're essential.

So we've covered the link itself.

Now let's talk about the destination process.

We often forget that even the fastest server still has to do actual work.

Yeah, fallacy four.

Processing is instantaneous.

It's not just the network latency.

Once the packet arrives, the remote server has to dedicate CPU cycles to it.

It has to deserialize the message, perform the core business logic, which often involves expensive disk IO, memory allocation,

complex computation,

and then serialize and send a response back.

And all that takes time.

It's a substantial non -zero cost.

And the problem is amplified in parallel operations.

If your task requires coordinating a hundred servers,

say, aggregating a query across a hundred shards, your total performance is limited by the response time of the slowest remote server plus all that coordination overhead.

The tail latency problem.

That's the one.

Even if 99 servers respond in 10 milliseconds, if that hundred server takes a full second because of, I don't know, garbage collection pause or a disk seek.

Your whole operation takes one second.

One second.

And that immediately ties into the next fallacy about system buffering.

Fallacy five.

Queue capacity is infinite.

This is the back pressure dilemma.

Queues are wonderful tools.

They help decouple the sender from the receiver.

They allow for asynchronous processing.

They absorb short bursts of traffic.

But the fallacy is thinking you can just make the queue bigger and solve your throughput problem.

Exactly.

It doesn't work.

Increasing capacity does not increase the processing rate.

It just means messages sit there longer, waiting their turn, which drastically increases latency.

So if the consumer is slower than the producer,

the queue just gets deeper and deeper until the whole system stalls or crashes.

And the engineering solution here is what you call back pressure.

Back pressure is crucial.

It's the strategy of communicating upstream.

I am overwhelmed.

Slow down.

Instead of just letting requests pile up and eventually dropping them, the consumer actively signals the producer to reduce its sending rate.

Can you give me a tangible example of how a system uses back pressure instead of just letting the queue grow?

Sure.

Think of a continuous manufacturing line.

If the assembly station, the consumer, can only handle 10 parts per minute, but the fabrication station, the producer is sending 20, the parts are just going to pile up until the whole line stops.

Right.

Back pressure is when the assembly station sends a signal, or maybe a credit, back to the fabrication station saying, I have capacity for five more parts.

This makes sure the production rate is governed by the slowest necessary step.

It maintains stability and avoids those latency spikes.

And it has to be designed from day one.

From day one.

Absolutely.

Yeah.

For the most, I think, philosophical of the assumptions, the one that makes true coordination almost impossible time.

Fallacy six.

Clocks on remote machines run in sync.

Every process has its own local physical clock, and these clocks drift over time.

Temperature variations, hardware differences, inherent instability.

It all causes drift.

So relying on local timestamps to, say, order events across different servers is disastrous.

So if you see two logs stamped one millisecond apart on two different machines, you have no guarantee of which event actually happened first.

None whatsoever.

How do systems dealing with extremely sensitive ordering problems, like massive financial transactions or global data consistency, get around this basic physical flaw?

Well, they can't bypass the drift, but they can account for it.

The solution the source cites is Google Spanner's TrueTime API.

It doesn't just return a single timestamp.

What does it return?

It returns a timestamp plus an uncertainty bound.

An uncertainty bound.

What does that represent and why is that so useful?

The uncertainty bound defines a tiny window, maybe a few milliseconds, in which the clock is guaranteed to be accurate relative to all the other synchronized clocks in the cluster.

Because Spanner knows that the absolute time of an event at ALR happens sometime within, say, Teerliest or Littest, it can use this bound to enforce an order.

When it's coordinating transactions, it can force any subsequent transaction to wait until its start time is outside the previous transaction's uncertainty bound.

This tiny weight ensures that every transaction is ordered strictly globally, achieving this high level of consistency called external consistency or serializability, all despite the physical clock drift.

It's a marvelous application of physics to system design.

Okay, the final two fallacies.

They deal less with the raw mechanics and more with the mental shortcuts that system designers take.

Right.

Fallacy seven.

State is fully consistent.

In traditional databases, the default assumption is strong consistency.

When you write a value, every subsequent read sees that exact value immediately.

But many modern systems don't work that way.

No, many modern distributed systems, especially large -scale web services or NoSQL databases,

intentionally use relaxed consistency models.

They allow for temporary state divergence between replicas.

So we assume consistency, but we're building systems that are eventually consistent.

What are the consequences?

The consequence is you have to design for disagreement.

If you assume consistency, but your system allows divergence,

you introduce these subtle high -impact bugs.

We saw this with an infamous Apache Cassandra bug related to schema propagation.

How did that divergence lead to actual corruption?

So the schema, the structure of the data itself was replicated eventually, not instantly.

If one server got a schema change, schema B, but another server was still operating into the old schema, schema A, they could process data differently.

Oh, no.

A client reading a record might hit the schema A server, which interprets the bytes incorrectly because the data was written by a schema B server.

This leads to data corruption, misplacing records because of different views of the cluster map.

Silent, hard -to -debug inconsistencies.

You have to rely on mechanisms like conflict resolution to detect and fix these divergent states instead of just assuming agreement.

And the final fallacy addresses the developer experience itself.

Fallacy eight.

Local and remote execution are the same.

This is the abstraction trap.

It's really common to hide the immense cost of a remote procedure.

Call latency, serialization, network transport behind a simple local looking API.

It makes the code cleaner.

It does, but it encourages developers to treat remote calls casually.

If a developer interleaves lots of small blocking remote calls with high -speed local processing, the performance degradation can be massive and completely unintended.

The network physics just dominate the application logic.

It sounds like this entire set of fallacies boils down to one thing.

The network and the remote server are not part of your local program.

They are foreign, unreliable entities, and you have to design your application that way.

Exactly.

Failures are subject to Murphy's Law, and our mindset has to shift from if to when.

We must prepare for partial failures, where only a small part of the system becomes unavailable.

And the most critical partial failure is the network partition.

It is.

That's where the system splits into two or more isolated groups that can't communicate, but they keep operating, which leads inevitably to conflicting results.

Since it's impossible to think through every single failure scenario—I mean, the number of combinations is astronomical—how do engineers effectively design fault -tolerant systems?

They embrace aggressive adversarial testing.

The source really emphasizes this.

Since human imagination fails, we need testing harnesses that simulate real -world adversity.

We simulate increased latencies, clock divergence,

physical hardware errors like bit rot.

Random data corruption.

Right.

And network partitions.

We hear about famous examples of this kind of testing.

Can you elaborate on some of the tools people use?

Certainly.

Netflix's Chaos Monkey is probably the most famous example.

It takes the radical approach of randomly shutting down services and entire machines in production.

In production.

This forces engineers to design systems that are resilient enough to handle immediate failure without any human intervention.

Tools like ToxaProxy let developers introduce simulated network failures in controlled environments, simulating limited bandwidth, packet loss, long random delays,

and things like toribdifs simulate file system and hardware errors, making sure durability protocols hold up when the disk returns an error or corrupt data.

So the point is to make the testing environment as hostile as a real -world cluster will eventually become over its lifetime.

That's it.

And once a failure does occur, what are the primary engineering strategies to prevent it from snowballing into a cascading failure?

A cascade happens when a failure in one service increases the load or latency on its dependencies, which makes them fail, which in turn increases the load further downstream.

You have to break that cycle.

How do we stop that propagation?

First, with circuit breakers.

It's a concept borrowed from electrical engineering.

In software, a circuit breaker monitors failure rates for a dependency.

If the rate exceeds a certain threshold, it trips.

It immediately stops sending requests to the failing service and just returns a fast -fail error to the caller.

So that gives the failing service time to recover without being hammered by more requests.

Yes, and it prevents the overload from propagating backward.

So the circuit breaker manages the healthy part of the system's interaction with the broken part, but what about the clients trying to access that service?

They need a smart retry strategy, and that means using back -off and jitter.

If a server fails, the client is programmed to retry.

But if hundreds of clients all retry immediately at the same time, You get a thundering herd problem.

A massive thundering herd problem.

You instantly overwhelm the recovering server, causing it to crash all over again.

So back -off solves the immediate repetition, right?

Yes.

A back -off strategy schedules retries, usually increasing the time between subsequent requests from a single client.

It's often exponential back -off.

But if all clients use the exact same formula, say, wait one second, then two, then four, they will all wake up and retry at the same time, still causing these huge load bursts.

Which is where jitter steps in to add the necessary randomness.

Precisely.

Jitter adds a small random time period to that back -off strategy.

So instead of retrying exactly at the four -second mark, one client might retry at 4 .1 total seconds, and another at 3 .98 seconds.

This random distribution smooths out the load over time, giving the recovering system a genuine chance to stabilize.

And we can't overlook the need for just basic data integrity checks, which are essential when data is moving across these unreliable links.

Absolutely not.

Validation through mechanisms like check -summing is non -negotiable.

Hardware errors, transit network errors, simple bugs.

They can all cause bit rot or silent data corruption.

If we don't verify the integrity of the content exchanged between nodes, a corrupted copy could be replicated across the entire cluster.

Potentially overriding healthy data.

Validation ensures that the data we rely on has not been fundamentally altered during transit or storage.

That sets the stage beautifully.

We've accepted that the network is hostile.

It's prone to failure.

So the next engineering challenge is turning that chaos into a structured, reliable communication protocol.

We start with the unreliable base and layer guarantees on top.

This section moves us into the engineering abstractions.

The theoretical links that let us build things like TCPIP or CAFTA protocols.

It's all about creating an illusion of perfection.

Okay, where do we start?

We start with the simplest abstraction.

The fair loss link between a sender A and a recipient B.

This is the least reliable model you can imagine.

It's a lot like UDP semantics.

When sender A transmits a message M, A can be in one of three states.

The message hasn't arrived yet.

The message is irrecoverably lost or it was successfully delivered.

And crucially, the sender does not know which state the message is in.

And what are the very limited guarantees that this raw unreliable link provides?

It provides three minimal guarantees.

One, fair loss.

If A and B are both correct and A just keeps retransmitting M infinitely often, the message will eventually be delivered.

So it can get through?

It can get through.

Two, finite duplication.

The link won't deliver the message an infinite number of times.

And three, no creation.

The link will spontaneously invent messages that were never sent.

The major problem here, as you said, is that the sender has no way to eliminate that awful state of uncertainty.

Irrecoverably lost.

A just has to keep retrying and hoping.

So to solve that delivery uncertainty, we introduced message acknowledgments, ACKs, and unique sequence numbers.

Communication becomes bidirectional.

Sender A sends M with sequence number no.

Recipient B gets it and sends back ACK.

And if A receives the ACK,

A can be confident of delivery?

Confident, yes.

But the ACK itself is a message and it's traveling over that same unreliable link.

It can get lost.

And that keeps the sender in the dark.

To solve this, we abstract the link further into what's called the stubborn link.

This abstraction is achieved by the sender, indefinitely retransmitting the message after a defined timeout period.

So since the sender just keeps trying until it succeeds, the state, irrecoverably lost, is eliminated from the sender's perspective.

Exactly.

The message is either delivered or it eventually will be.

But the stubborn link guarantees delivery, but now we have guaranteed duplication because of all those indefinite retransmits.

Exactly.

And this forces us to address idempotence.

An operation is idempotent if executing it multiple times gives you the exact same result without any additional side effects.

So reading data is idempotent.

Writing an update like set 6 text is idempotent.

But the multiplication operation 6x equals 202 from our first example is not idempotent.

If we send that message twice, the result changes from 6 to 12.

Right.

And since we can't redesign fundamental operations like say, charging a credit card or appending to a log to be inherently idempotent, we need an equivalent guarantee at the system level.

We need to provide the guarantee that's equivalent to idempotence, which is called deduplication.

And this is the key component for the next stronger link abstraction.

The perfect link?

The perfect link.

This is the theoretical ideal of reliable communication.

And it's built by layering deduplication and ordering guarantees on top of the stubborn links reliable transmission.

Okay.

So how does the recipient node manage both deduplication and ordering when messages might arrive multiple times and out of sequence?

It relies heavily on those unique sequence numbers.

The recipient tracks two key values.

No consecutive and then processed.

Let's break those down.

What is consecutive used for?

No consecutive tracks the highest sequence number received where all messages leading up to it have also been received.

So if message five arrives before message four, the recipient notes that no consecutive is stuck at three.

And message five gets buffered.

It gets put into a reordering buffer.

This ensures first in first out FIFO ordering at the application level.

Only when message four finally arrives is the stream passed up to the application.

Four then five.

Okay.

So it maintains the correct historical sequence for processing.

And what about not processed?

Not processed tracks the highest sequence number that has been definitively executed by the recipient process.

Deduplication is then trivial.

Any incoming message whose sequence number is less than or equal to not processed is just discarded.

We know it's duplicate created by the stubborn links retransmissions.

That makes perfect sense.

So by managing these two counters, the recipient node takes the absolute chaos of the network and imposes a perfect sequential non -redundant order on the messages.

And that gives the perfect link its three powerful properties.

Reliable delivery, no duplication, and no creation.

This is the theoretical basis for practical protocols like TCP, right?

Which while it's more complex because of flow control gives us that session scope reliability we rely on every day.

That's the one.

So now we arrive at one of the great jokes in distributed computing.

There are only two hard problems in distributed systems.

Exactly once delivery, guaranteed order of messages, and exactly once delivery.

Why is this concept so elusive even with the perfect link?

Because we have to distinguish between delivery semantics and processing semantics.

You have at least once delivery where the sender retries until it gets an ACK, meaning duplicates are possible.

And at most once.

And at most once delivery where the sender sends once, no retry, so loss is possible.

The reason exactly once delivery is so hard at the application level is that the perfect link only guarantees delivery at the packet level within a session.

If the message is successfully processed and persisted by the receiver, but the final ACK back to the sender gets lost,

the sender's application layer, when his timer goes off, assumes failure and retries the entire operation.

And that retry results in a duplicate operation from the application's perspective.

Right, so if the perfect link handled the network side, the application layer is responsible for the processing side.

The real goal isn't delivery, it's exactly once processing.

Exactly once processing.

Precisely.

The goal is to make sure the core operation, the record update, the credit card charge, is applied and persisted once and only once.

We achieve this by combining the perfect link's delivery guarantee with our deduplication mechanism.

Even if the network delivered the packet three times, the receiver processes the message once,

and then the other two are just silently ignored as duplicates.

You mentioned that achieving true guaranteed exactly once processing requires something called common knowledge.

Can you quickly remind us why that's such a high bar?

Common knowledge is the state where everyone knows the state, and everyone knows that everyone else knows the state, and so on, infinitely.

It requires an absolute confirmed consensus.

And if a system is asynchronous and communication can fail, you are always one message away from achieving that common knowledge.

You can't safely proceed knowing the other party might not have received your final confirmation.

And as we're about to discuss, those limits on consensus aren't just practical engineering problems, they're mathematical impossibilities.

Yes, they are.

So we've built the perfect link.

We can send messages reliably.

Now let's try to use those perfect messages to agree on a state value or a course of action.

It turns out that achieving agreement across multiple independent processes is where the theoretical walls just go straight up.

Right.

And this brings us to the two generals problem, a thought experiment from 1975 that just beautifully illustrates the impossibility of achieving common knowledge under asynchronous, unreliable communication.

Let's set the scene.

We have two generals, A and B, on opposite sides of a fortified city.

They have to attack at the same time to succeed.

If only one attacks, they both fail.

And they can only communicate via messengers, and those messengers might be captured.

So general A decides on 4 a .m.

and sends a messenger with the plane M, and then A has to wait.

A cannot safely attack until A knows that B received the message.

Okay, let's say the messenger arrives.

General B now knows the plan, but B knows that A is waiting for confirmation.

So B has to send an acknowledgement, an ACK.

If B attacks without sending that ACK and A didn't receive it, A won't attack and B will be destroyed.

So B sends the messenger with the ACK.

But now D has the uncertainty.

B doesn't know if the messenger carrying that ACK made it back to A.

Right.

If A didn't get the ACK, A will assume B never got the original plan and will abort the attack.

So B can't safely proceed either.

So A gets the ACK and is now ready, but now A has to confirm to B that the ACK was received.

So A sends an ACK ACK, but that messenger might be captured.

This is the infinite regression.

They are perpetually trapped in this loop.

No matter how many confirmations they exchange, the last messenger in the chain is always subject to failure.

They can never reach that required state of common knowledge to safely agree to attack.

If they were a database system, they could never agree on a transaction commit.

That is profound.

It shows that coordination, even with guaranteed message content, is fundamentally bounded by the possibility of the link failing.

But what about when the process itself fails?

Right.

So the two generals' problem deals with the link failure.

The FLP paper from Fisher, Lynch, and Patterson in 1985 deals with process failure, specifically the inability to guarantee agreement when a process crashes.

Okay, before we get to the conclusion, let's define the goal of a robust consensus protocol.

A consensus protocol aims to bring multiple processes from some initial state to a unanimous decided state.

To be correct, it has to satisfy three essential properties.

Which are?

One, agreement.

All non -faulty processes have to eventually decide on the same value.

Unanimity is required.

Two, validity.

The agreed value has to have been one of the values initially proposed by a participant.

No inventing new values.

And three.

And three, termination.

Here.

All non -faulty processes must reach a decision state in a bounded amount of time.

And the crucial assumption the FLP paper makes is that the system is completely asynchronous.

This is the linchpin.

A truly asynchronous system has no shared notion of time, no bounds on message delivery latency, and no bounds on processing speed.

And critically, because there are no bounds, timeouts cannot be used.

And without timeouts, we cannot reliably detect failures.

Exactly.

If process A sends a proposal to process B, and gets no response after five seconds.

A has no way to distinguish between two fatal possibilities.

B has crashed entirely, which is a failure.

Or B is just running extremely slowly, which is not a failure, but a performance issue.

A slow process is indistinguishable from a crashed process in a purely asynchronous world.

And what is the inescapable conclusion that comes from that indistinguishability?

The FLP conclusion is this.

In a completely asynchronous system, no algorithm can guarantee consensus in a bounded time if even a single process crashes.

The property that breaks is termination.

The system might run forever, waiting for that crashed or infinitely slow process.

And so it can never guarantee it will reach a decision.

That sounds like a mathematical death sentence for any robust distributed database.

I mean, if we can't guarantee consensus when one server fails, how can systems like Paxos or Raft even exist?

This is the essential nuance.

And it leads us directly into our final section.

FLP proves consensus is impossible only under the strictest, most adversarial asynchronous rules.

The solution that every real -world consensus algorithm uses is simple.

We move away from the purely asynchronous model.

We cheat.

Since pure asynchronous consensus is mathematically impossible if we demand termination, all practical workable algorithms introduce assumptions about timing and failures.

We have to sacrifice theoretical purity to achieve practical performance.

So let's formalize the three models of timing assumptions designers use to evade that FLP barrier.

Okay.

First, you have the asynchronous model.

This is the FLP environment.

No timing guarantees at all.

Failure detection via timeouts is impossible.

Second, you have the synchronous model.

This is the ideal world.

It assumes comparable process execution rates, strictly bounded transmission delays, and synchronized clocks or at least known bounded drift.

A synchronous model seems too good to be true for the internet.

But what does assuming synchrony allow us to do?

It allows us to use timeouts with absolute confidence.

If I send a message and I don't get an expected response within twice the maximum network latency, I can declare the remote process failed and take decisive action.

Like electing a new leader.

Exactly.

Or abandoning the transaction.

This is necessary for highly coordinated real -time systems.

But there's a danger.

If those timing assumptions are grossly violated, if the network is partitioned and latency spikes,

multiple non -faulty processes might incorrectly declare each other dead, which could lead to simultaneous leader elections or conflicting commitments.

And what's the practical middle ground model that most modern consensus algorithms rely on?

That is the partially synchronous model.

This is the practical evasion of FLP.

It assumes that while strict timing bounds might not hold at all times, they hold true most of the time.

The system might be asynchronous for some periods, but for sufficiently long periods it behaves synchronously.

This model, proposed by Dwork, Lynch, and Stockmeier in 1988, is what allows us to design robust consensus algorithms like Raft and Paxos.

They use timeouts and re -elections, accepting that during brief periods of asynchrony, liveness or termination might be temporarily sacrificed, but consistency agreement is always preserved.

Now that we have the timing context, we can define the types of failures the system has to tolerate.

Let's start with the simplest and most common.

Crash faults.

The process just stops executing any further steps required by the algorithm.

And we distinguish between two variations of this.

You have crash stop, where the process stops and stays stopped.

Algorithms designed for this are simpler because they don't have to account for a process rejoining the cluster.

And the other one?

Crash recovery.

The process stops but is expected to restart later and try to resume its duties.

This is the model most modern database systems use.

It requires processes to store durable state on disk so they can recover their exact state, what they knew, what they committed, right before the crash.

Crash recovery introduces a lot of complexity because the recovery protocol itself has to be robust against failure.

Indeed.

It has to manage every possible recovery point, making sure the process doesn't resume with a state that contradicts the rest of the cluster.

Okay, next.

Let's consider failures that aren't a total shutdown.

Omission faults.

This is where the process skips some algorithm steps or fails to send or receive messages.

This failure model captures real -world issues like network congestion, full queues, our Fallacy 5 problem, or processes running too slow to respond in time.

And crucially, a slow node, by failing to send timely responses, appears to the rest of the system as though it's omitting steps, which blurs the line between an omission fault and a performance issue.

And finally, the most difficult and expensive fault to design for, the arbitrary or malicious fault.

Arbitrary Byzantine faults.

This is where the process behaves maliciously.

It might supply incorrect values, lie about its internal state, or even cooperate with other faulty nodes to subvert the system.

Byzantine failure tolerance is the highest bar.

And why would a system need to account for this level of adversarial behavior?

Primarily in environments where trust cannot be assumed, or where the stakes are just incredibly high.

Like what?

For instance, in aerospace systems controlling aircraft, or in decentralized trustless environments like public blockchain networks.

These systems have to be designed so that even if a minority of nodes are actively trying to sabotage the outcome, the majority of non -faulty nodes can detect the deception and still reach a valid consensus.

So given this whole spectrum, from simple crashes to malicious intent, where do most distributed database systems draw the line for practicality?

They assume simplicity and they prioritize performance.

The vast majority of distributed database algorithms, including popular replication and consensus systems, assume the simpler crash failure model, either stop or recovery.

They rely on redundancy to mask these common failures, trading off the theoretical ability to handle malicious Byzantine faults for much easier implementation, better performance, and conceptual simplicity.

And this design choice is fundamental.

It means they can achieve consensus faster and more efficiently, but only if they're operating in an environment where all component failures are due to technical malfunctions, not malice.

So today, we took a deep dive into the very roots of distributed systems design.

We saw how the simple act of sharing state created non -determinism, forcing us to invent consistency models.

And we cataloged the inevitable failures inherent in network communication through the eight fallacies, recognizing that latency, bandwidth, and consistency,

they're not guarantees, they're trade -offs.

Exactly.

We learned how to engineer reliability back into the link by layering abstractions, moving from the chaotic fair loss link to the perfectly sequential perfect link, using sequence numbers and deduplication.

And then the theoretical limits of coordination, the two generals, can never achieve common knowledge because of that unreliable link.

And the FLP theorem proves that guaranteed consensus is impossible in a purely asynchronous world if a single process crashes.

Right.

The entire structure of modern distributed databases is built on recognizing these impossibilities and engineering clever workarounds, primarily by abandoning the purely asynchronous model for the partially synchronous model.

So the crucial insight here is that you now understand the cost of every high -level database guarantee when a database promises durability or strong consistency.

It's promising that its consensus algorithm can successfully navigate these fallacies and these impossibility results.

It does that using explicit rigorous failure models and timing assumptions.

And this knowledge is the key to understanding why one system performs better than another, or why a system breaks when network conditions degrade.

This has been an incredibly fundamental look at the obstacles that define the entire landscape of distributed computing.

It really has.

So for our final provocative thought, let's revisit our escape hatch from FLP.

We use timing assumptions, the partially synchronous model, to bypass the impossibility result and guarantee consensus.

If these timing bounds are massively violated, if network latency spikes far beyond the expected maximum, or clocks drift widely and the system effectively reverts to being truly asynchronous, what is the single most critical operational failure mode that results from this breach of assumption?

A hint.

Think about what happens when the timeout mechanism, which is our only way to distinguish a slow process from a crashed one, fails.

The answer is that the system can suffer a split -brain event, where multiple non -faulty nodes incorrectly time out the legitimate leader and elect new, conflicting leaders, violating the critical consensus property of agreement.

Something for you to ponder.

Until next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Transitioning from single-machine to distributed database systems introduces fundamental challenges rooted in the absence of a shared execution timeline, creating complexity that single-node architectures never face. Understanding the distinction between concurrency, where operations interleave but may not execute simultaneously, and parallelism, where genuinely simultaneous execution occurs across multiple processors, is essential for reasoning about system behavior. Distributed computing rests on several widely held but dangerously incorrect assumptions: that networks reliably deliver all messages, operate with negligible delay, or provide unlimited bandwidth. These fallacies create serious vulnerabilities; realistic systems must account for transmission delays, implement message queuing strategies, and deploy backpressure mechanisms to prevent resource exhaustion under load. Coordinating events across geographically dispersed nodes becomes problematic when local clocks drift unpredictably, making timestamp-based ordering unreliable without specialized synchronization hardware. Link abstractions ranging from fair-loss communication, which may silently drop packets, to perfect links that guarantee delivery using sequence numbers and acknowledgments, provide a framework for reasoning about message guarantees. Foundational theoretical results establish hard limits on what distributed systems can achieve: the Two Generals' Problem demonstrates the impossibility of perfect coordination over unreliable channels, while the FLP Impossibility theorem proves that asynchronous consensus cannot be guaranteed when even a single process might crash. Different failure modes present escalating challenges, from crash-stop failures where nodes simply halt, to message omission failures, to Byzantine faults where compromised or malicious nodes exhibit arbitrary behavior. Building resilient systems requires practical techniques including circuit breaker patterns that prevent cascading failures, exponential backoff with randomized jitter to avoid thundering herds, and careful management of retry logic. Developers must internalize that partial failures are inevitable in distributed databases and design accordingly.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 8: Distributed Systems: Introduction and

Related Chapters