Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome back to the Deep Dive.
Today we are, well, we're taking a massive swing at one of the most fundamental challenges in modern computing.
A really big one.
Yeah, designing a key value store.
This is the engine room,
the absolute core of almost every large -scale non -relational database you interact with daily.
Think Dynamo, Memcached, Redis.
So our mission today is to really understand the engineering choices you have to make to build a system that can store petabytes of data, handle millions of requests, and crucially stay online when things inevitably break.
And that's a huge challenge.
When we talk about a key value store, we're really talking about the simplest database possible.
You're just mapping a unique identifier, the key to some chunk of data, the value.
And that key could be plain text, something like user profile 123.
Right, or it might be a hash string like 7EF3DCC9, something more compact for better performance.
And the value itself can be anything, a string, a whole JSON object, a list.
Whatever you want to throw at it.
But the key thing,
the really important functional rule here, is that the database usually treats that value as totally opaque.
Meaning it doesn't.
It doesn't care what's inside.
Exactly.
It doesn't try to parse it, it doesn't try to understand it.
It just stores it and retrieves it whole.
So for our design, the goals are pretty ambitious.
We need simple put -and -get functions for data under 10kb, but at a normal scale.
And we need, what, high availability, automatic scaling.
Yep.
Automatic scaling, low latency, and this thing you hear all the time, tunable consistency.
Right.
And that immediately kicks us out of the realm of a single server.
Why can't we just stick an optimized hash table on one giant machine?
Because eventually that single machine runs out of memory, or I -O capacity, but mostly it just runs out of space.
To hit that big data requirement, you have to go distributed.
A distributed key value, or… Or distributed hash table, yeah.
And the moment you have multiple servers connected by a network,
well, that's when you run straight into the most famous and most challenging constraint in all of distributed systems.
Which brings us to the CAP Theorem.
For anyone designing these big systems, this is basically the impossibility theorem.
It says in a distributed network, you can only ever guarantee two out of three properties.
Those being consistency, availability, and partition tolerance.
Let's define those, because the terms can get a little fuzzy.
Sure.
Consistency is the easy one to grasp.
It just means all clients see the same data at the same time, no matter which server they're talking to.
It's like checking your bank balance.
You want the exact, up -to -the -millisecond truth.
Perfect example.
Then you have availability.
This means any client request gets a timely response.
It might not be the most up -to -date answer, but you get a response.
The system is always on.
Even if some of the nodes are down.
Exactly.
And finally, partition tolerance.
This is just acknowledging that networks are unreliable.
Communication links between servers will break.
A partition?
The system has to keep operating anyway.
So if you accept that partitions are unavoidable, and in the real world they absolutely are, then partition tolerance isn't really a choice.
It's mandatory.
It is.
Which forces you into a really hard choice.
You have to sacrifice either consistency or availability.
You can't have both.
Not perfectly.
Not in a system that asks to work across a shaky network.
So that's the foundational choice.
Let's make it concrete.
Imagine data is replicated across three nodes.
N1, N2, and N3.
And then poof, N3 gets cut off by a network partition.
What happens if our system prioritizes consistency?
If you choose a CP system consistency plus partition tolerance, which is typical for, say, a bank where data integrity is everything, you have to block operations.
Mean air.
If a client tries to write new data, the system sees it can't guarantee the write has reached N3.
So to prevent any chance of inconsistency, N1 and N2 just refuse the request.
They just stop working until N3 comes back.
They stop writing, yeah.
You get data integrity, but you sacrifice availability, the user gets an error message.
And the alternative is an AP system.
That's where you choose availability, over that strict consistency.
This is the model that most web scale systems, like Amazon's Dynamo, are built on.
When N3 goes offline, N1 and N2 just keep going, they keep accepting reads and writes.
Even if the data might be a little stale.
Exactly.
The system is always responsive.
It embraces what we call eventual consistency.
Okay, so once we've made that choice, we face the next huge problem.
How do we actually spread all this data across hundreds or thousands of servers?
Right, the data partitioning problem.
You can't just use a simple module hash, because if you add or remove one server, you have to move almost all your data, a total non -starter.
So we need something smarter.
We need consistent hashing.
Imagine a huge numbered clock face,
a hash ring.
We take all our servers, let's say 0 through S7,
we hash their names or IPs, and we place them at different points around this ring.
Okay.
Then a new key comes in, we hash that key, and it lands somewhere on the same ring.
To figure out where to store it, you just move clockwise from the key's position until you hit the first server.
That server owns that key.
Ugh, that's clever.
Because if, say, server S1 crashes, you don't have to rehash the entire system.
Its keys just naturally fall to the next server on the ring, S2, so only a small amount of data needs to be moved around.
Exactly.
It makes scaling automatic and graceful.
But what about the fact that some servers are more powerful than others?
How do you give them more of the load?
Ah, yeah, you use the concept of virtual nodes.
Instead of putting one big server on the ring once, you hash it multiple times.
So a powerful machine might appear as, say, 10 virtual nodes all around the ring.
Which guarantees it gets a proportional share of the data.
Precisely.
It handles that heterogeneity automatically.
Now, for high availability, one copy is never enough.
We have to replicate.
How does the ring help with that?
So we define a replication factor.
We'll call it NOLARs.
It's usually three.
When a key is placed on a server, say, S1, that server is responsible for replicating the data to the next NOLAR unique servers you find moving clockwise on the ring.
So if N is 3, key A0 goes to S1 and then gets copied to S2 and S3.
Exactly.
But there's a critical real -world rule here for placing those replicas.
They can't be in the same physical location.
They absolutely must be spread geographically.
You want those three replicas in different availability zones or even better, different data centers.
So if a whole data center goes down, you're still fine.
OK.
So data is replicated.
But we chose eventual consistency, which means the copies aren't always identical.
We need a way to manage that, to let engineers tune the level of consistency.
And that's where quorum consensus comes in with those three magic letters, $1 and $2.
NOLARs is our replica count, so three in our example.
Right.
NOLAR is the right quorum.
It's the minimum number of replicas that have to acknowledge a right before we can tell the client success.
And $ is the read quorum, the minimum number of replicas you have to hear back from to return a result to the client.
Yep.
And there's a coordinator node that acts as a proxy.
It takes the client's request, sends it to the replicas, and then just waits.
So if we set $1, 1, the coordinator only has to wait for the fastest of the three replicas to respond.
That's it.
As soon as one says, I got it, the write is considered successful.
But that sounds risky.
What if that one fast node goes down before the other two ever get the update?
That is the precise trade -off.
If $2, 1, 1, your system is blazing fast, your write latency is super low, but your consistency is also very low.
A read right after could hit one of the old nodes.
How do you guarantee strong consistency?
The secret sauce is making sure your write set and your read set always overlap.
And that's guaranteed when $TABA, plus RN, and ENORE.
Let's walk through that.
If on May 3, and we set both $1 and $3 to $1, a successful write has to reach at least two nodes.
A successful read has to query at least two nodes.
So any read must overlap with at least one of the nodes that got the latest successful write.
Exactly.
There's always an overlap of at least one node.
You get strong consistency, but you pay for it in performance.
Your latency is now dictated by the second slowest node, not the fastest.
So if you want high availability and speed, you might choose something like W11 and $1 and just accept that you'll have eventual consistency.
That's the choice many systems like Dynamo and Cassandra make, yes.
Okay, so we've established that these systems will have concurrent updates.
This leads to conflicting versions.
I update John's city to San Francisco on node A.
You update it to New York on node B at the same time.
How does the system even detect that divergence?
This is where a brilliant little mechanism called a vector clock comes in.
Think of it as a chain of custody for a piece of data.
It's not just one version number.
It's a list of server version counter pairs.
And this list helps it figure out if one version is an ancestor of another or if they're conflicting siblings.
Precisely.
Let's walk through it.
We start with data D1 written by server SX.
The clock is simple.
D1, SX1.
Okay, easy enough.
Then D2 is written by SX again.
The clock is now D2, SX2.
Still easy.
D2 is clearly an ancestor of D1.
No conflict.
But now it gets messy.
Right.
Now, client A reads D2, modifies it, and the right goes to a different server, Psi.
This creates a new version, D3, and its clock is D3, SX2, Psi1.
It keeps the history from SX and adds its own new version.
Exactly.
But at the same time, client B also read D2, modified it, and its right went to a third server, SD.
This creates D4, with a clock of D4, SX2, SDN.
So when the system sees both D3 and D4, how does it know there's a conflict?
It compares their clocks.
Neither clock fully dominates the other.
D3 has information about server Psi that D4 knows nothing about, and D4 has info about C that D3 doesn't have.
They both came from the same point, D2, but they branched off independently.
They are definitively conflicting siblings, and the system has successfully detected the divergence.
So what happens next?
Does the database fix it?
This is the key choice in systems like Dynamo.
The conflict gets punted back to the client.
The system doesn't try to be smart.
It presents both versions, San Francisco and New York, and tells the application, you figure it out.
Wow.
That seems like a major downside.
You're pushing a lot of complexity onto the application developers.
It is a downside.
Another one is that the vector clock itself, that list of server version pairs, can get really long if an item is updated a lot.
It's a scaling challenge.
OK.
So failures.
They're just a fact of life at this scale.
How do we figure out if a peer node has gone offline without some central server monitoring everything?
This is a perfect use case for the gossip protocol.
It's just like how rumors spread.
Every node keeps a membership list of all the other nodes and their current heartbeat counter.
A heartbeat counter?
Yeah.
Just a number that it increments every so often.
Periodically, each node sends its updated counter to a few random peers.
And those peers then, well, they gossip about it to other peers.
And if a node's heartbeat hasn't been updated for a while?
Everyone in the network quickly learns that the node is probably offline.
It's totally decentralized and really efficient.
So once we detect a failure, say a temporary one, how do we keep serving requests?
We use a couple of tricks.
First is sloppy quorum.
If the official owner nodes for a key are down and we can't meet our dollar or toto value, the coordinator doesn't just fail.
It temporarily uses the next available healthy nodes on the ring.
It's sloppy because it's not using the designated nodes.
Right.
And connected to that is hinted handoff.
Let's say node B steps in for a downed node A.
Node B stores the data, but it also saves a hint that says this data actually belongs to A.
So when A comes back online?
Node B sees that and hands the data back to A.
It resolves the temporary inconsistency.
That's for temporary failures.
What if a replica is just gone forever?
How do we keep the other copies in sync?
That's called anti entropy, right?
Right.
The anti entropy protocol.
And for this, we use something really clever for efficiency.
Merkle trees or hash trees.
Instead of comparing every single key value pair between two replicas.
Which would be a ton of data transfer.
Instead you divide the key space into buckets.
You hash the keys in each bucket.
Then you hash those hashes and you keep hashing upwards until you have a single root hash for the entire data set.
So to check for inconsistencies, two replicas just have to compare their single root hash.
That's the first step.
If the roots match, they're identical.
Done.
Zero data transfer.
If the roots don't match, they don't send the whole database.
They just traverse down the tree comparing the children's hashes.
To pinpoint the exact bucket of keys that's different.
Exactly.
It's like comparing the table of contents of two books instead of reading them word for word.
You only transfer the data that's actually different.
It's a huge efficiency win.
And what about the worst case scenario?
A whole data center goes out.
The solution there is simpler but more expensive.
You just have to replicate your data across multiple geographically distinct data centers.
No way around it.
So pulling all this together, the architecture we've built is completely decentralized.
Totally.
Any client can talk to any node.
Every node in this system has the exact same set of responsibilities.
There is no single point of failure.
And each node is a real multitasker.
What are those core jobs?
Oh, they do everything.
They handle the client API.
They run the failure detection with gossip.
They manage conflict resolution with vector clocks.
They do failure repair with hinted handoff and Merkle trees.
And of course they run the local storage engine and replication.
Let's follow a write request through a single node using a model like Cassandra's.
What's the very first thing it does?
The first goal is durability.
So the request is immediately appended to a commit log file on disk.
That's your insurance policy.
If the server crashes right then, you can recover the write from that log.
Okay, so durability is covered.
What about speed?
For speed, the data is then saved into an in -memory cache, the memory cache, fast access.
It's only when that cache gets full that the data gets flushed to its final home on disk.
Which is an SS stable, a sorted string table.
Right.
It's a sorted list of key value pairs.
Storing it sorted makes reads much, much faster later on.
Okay, so now for the read path.
We want low latency.
What's the process?
First, always check the memory cache.
If the data is there, you're done.
Return it immediately.
And if it's not, we have to go to disk and there could be hundreds of these sables.
Right.
And you can't afford to check every one.
So before you do that, you check a Bloom filter.
Ah, the probabilistic bouncer at the door.
That's a great way to put it.
It can tell you with certainty, this key is definitely not in this file.
Or it can say, this key might be in this file.
And its one quirk is?
It can have false positives.
It might tell you to check a file that doesn't have the key.
But I'll never have a false negative.
It will never tell you a key isn't there when it actually is.
So you accept a tiny chance of a wasted disk read for a massive speed up in finding the right files to check?
Exactly.
The system uses the Bloom filters to find the right SS stables, reads the data from disk and sends the result back to the client.
This has been quite a journey.
This dive really shows the engineering compromises needed to build this kind of massive infrastructure.
So to recap the pillars here, we use consistent hashing for scalability and data distribution.
We use quorum consensus, Nundollar, WW, Arlar to let the application tune its own consistency level.
And then we have this whole suite of tools, gossip, sloppy quorum, vector clocks, Merkle trees to handle failure and manage the chaos that comes from network partitions.
I think the biggest takeaway really goes back to that philosophical choice we started with.
Strong consistency is so tempting because it's simple to reason about.
But to get the high availability you need in the real world, you're almost forced to embrace the complexities of eventual consistency.
And the real art is in kicking the right values for W and R.
That's it.
The true craft of the system designer is figuring out that perfect balance, that tradeoff, between performance and integrity for their specific application.
It's a powerful reminder that so much of engineering is just finding the optimal compromise.
Thank you for joining us for this deep dive into scalable system architecture.
We'll see you next time.