Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Dive.
Today we are cracking open chapter 6 of the classic designing data intensive applications and we are going straight to the core of scale,
partitioning.
Right.
I mean, if you're building anything that needs to handle massive user loads, billions of records, global traffic, that kind of thing.
Exactly.
You hit the limit of a single database server almost immediately.
And that's where partitioning or sharding, as a lot of people call it, that's the engineering answer.
It's about taking a giant database and intentionally breaking it into smaller pieces.
Absolutely.
And our mission for this Deep Dive is to get into the nuts and bolts of that.
We're going to distill the core concepts, the strategies, and the really high stakes trade -offs involved in distributing all that data.
Because that's what it comes down to.
This is what determines whether your application just works or if it can actually scale globally.
And we should probably get the terminology straight first, right?
We're going to see partitioning.
We will, yeah.
It's the most established term.
But you'll see this everywhere under different names.
MongoDB calls them shards.
HBase uses regions.
Cassandra has nodes.
The idea, though, is always the same.
It's all about achieving scalability by spreading the load across a whole cluster of machines.
A shared nothing cluster, specifically.
And when we talk about scalability, what we're really trying to do is avoid two fatal flaws.
Okay.
What are they?
First is skew, where your data isn't spread out evenly.
And second is hotspots, where one of your partitions just gets hammered with way too much of the query load.
And if you fail at either of those, you haven't really scaled anything.
Not at all.
You've just replaced one giant bottleneck with a new, slightly smaller bottleneck.
Okay.
So let's untack this.
I mean, you can't just randomly assign records to partitions.
Sure, that solves skew.
But then to find any single piece of data, you'd have to, what, query every single machine?
An impossible read scenario.
So the very first challenge is figuring out a reliable, repeatable way to decide which piece of data goes where.
Which leads us to the two fundamental strategies for key value partitioning.
The first one is probably the most intuitive.
It's partitioning by key range.
Like a paper encyclopedia.
That's the analogy from the book.
Exactly.
Volume one gets A and B, volume two gets C and D, and so on.
You assign a continuous range of keys to each partition.
The huge advantage there has to be the sorted order, right?
If your keys are sequential, you can do these incredibly efficient range scans.
Oh, absolutely.
If you're building a social media feed and your keys are, say, user ID plus a timestamp, you can pull a user's entire history for one week just by hitting a single partition.
It's critical for any kind of time series data.
But the drawback is the hotspot risk.
That's the killer.
It is.
I mean, imagine a database partitioned purely by time.
All your new writes for today land on one partition.
That single machine gets 100 % of the write traffic, and the rest of your expensive cluster is just sitting there idle.
So if your key is something that's always increasing, like a timestamp, you're basically guaranteeing a hotspot.
You'd have to prefix the key with something else, like a user ID, to spread those writes out.
You'd have to, yeah.
You deliberately break up that nice continuous key to get distribution.
But sometimes you just need a better way to distribute the load immediately.
And that brings us to the second strategy.
Partitioning by hash of key.
Right.
So here we stop caring about the key's alphabetical or numerical order.
We just run the key through a good hash function like MD5 or FNV.
And that function turns the key into what looks like a totally random uniform number.
And because that hash output is uniform, you can assign different ranges of hash values to your partitions.
This is fantastic for load distribution.
Keys that would have been right next to each other and caused a hotspot are now scattered all over the cluster.
But the trade -off is, it's brutal.
You completely lose the sorted order.
It's gone.
So if you want to ask a question like, find all customers between Smith and Wesson, you can't.
Smith and Wesson are hashed to completely different random places.
You have to query all the partitions.
That's that scatter -gather process.
Exactly.
And a quick note on terms here.
The source material mentions consistent hashing, which can be a bit confusing.
In the database world, it's really clearer to just stick with term hash partitioning.
The original academic idea doesn't quite map to how these systems handle rebalancing today.
So we're stuck.
Key range gives us sorted reads but hotspots.
Hashing gives us great distribution but kills those range reads.
Is there a way to get the best of both?
There is.
And it's a kind of hybrid approach.
You see it in systems like Cassandra with a compound primary key.
How does that work?
So you define a key with multiple parts.
Let's say it's user, timestamp.
The key part is that only the first part of that key gets hashed.
So you hash the user read to figure out which partition it goes to.
Exactly.
So user A's data and user B's data land on different evenly spread nodes.
But, and this is the clever bit, the rest of the key, the timestamp, is used to keep the data sorted within that partition.
Ah, so that's the sweet spot.
Get the load distribution across users, but you can still do efficient range scans to get, say, all of one user's posts ordered by time.
With a single request to a single node, it's a very elegant solution.
But it still doesn't solve the absolute worst case scenario, does it?
The celebrity user, the one ID that gets 90 % of all the traffic.
Hashing doesn't help there.
No, because that one user's ID always hashes to the same single partition.
That's a truly skewed workload, and the solution isn't really in the database anymore.
It's in the application.
So what do you do?
The common fix is to artificially break up that hot key.
You might append a random number to the end of it, say, from 0 to 99.
Okay, so instead of 10 million requests hitting user 0 .42, they're now spread across user 0 .42 .01, user 0 .42 .02, and so on.
You solved the right distribution.
You have.
But you've just created a massive headache for reads.
Now, to get all of user 42's data, your application has to go and query all 100 of those derived keys, and then stitch the results back together.
So you've traded a right bottleneck for a guaranteed complex read.
It's the classic trade -off.
Distribute the pain, or centralize it.
Okay, so we've covered primary keys, but now for the inevitable complication.
Secondary indexes.
If I've partitioned all my users by their ID, how on earth do I efficiently search for all the cars whose color is red?
Right.
The car's color has nothing to do with the user ID partition.
And secondary indexes
are essential for almost any database.
The problem is they really challenge the whole idea of partition isolation.
So how do databases handle this?
There are two main approaches, right?
Two primary approaches, yeah.
Let's start with the one that prioritizes fast and simple rates.
That's document partition indexes, or what people often call local indexes.
Local indexes.
So each partition is its own little world.
Exactly.
Each partition maintains its own index that only covers the documents stored on that machine.
If a car record is written to partition A, only partition A's local index gets updated.
It's simple and fast for writes.
But to find all the red cars, since they could be on any partition, I have to send my query everywhere.
Everywhere.
That's the scatter -gather operation we mentioned.
You have to ask all partitions and then combine the results.
And that's where you get that tail latency amplification problem.
Yes.
Your whole query is only as fast as the absolute slowest machine to respond.
If one note is busy, the user waits.
The average speed doesn't matter, only the slowest one.
That's a pretty dangerous situation for any critical read path.
So what's the alternative?
How do we prioritize read speed?
That brings us to term partition indexes, or global indexes.
Here, you have one big logical index that covers all the data.
But that index is itself partitioned.
But it's partitioned by the term being indexed, not the primary key.
Precisely.
So in our car example, all the entries for color .red would live together in a single index partition, no matter where the actual car documents are stored.
And the read advantage is huge.
If you want all red cars, you make one single targeted request to the red index partition.
No scatter -gather.
It's incredibly efficient.
But if we solved the read problem, we must have made the write problem worse.
That's what I was thinking.
If I update one car and change its color and its make, I could be hitting two or three different index partitions now.
Doesn't that make writes incredibly complex?
It absolutely does.
A single write can turn into a distributed transaction.
You have to update the primary data, delete from the old red index, add to the new blue index.
Managing that atomically is very, very difficult.
So how do real systems handle that?
Often they just don't.
They skip the complexity of distributed transactions and instead make the index updates asynchronous.
Ah.
So the primary data gets written and then the global index just catches up a few moments later.
Right.
Which means you have to accept a period of potential inconsistency.
A client might read the new color from the primary record, but if they immediately search the index, the old color might still be listed.
So there's the trade -off.
Local indexes give you simple, consistent writes, but slow reads.
Global indexes give you fast reads, but complex or potentially asynchronous writes.
Every system has to pick its poison.
That's the crux of it.
Okay.
So we've distributed the data, but clusters are not static.
Nodes fail.
We need to add more capacity.
We have to be able to move data around without taking the whole system offline.
We need to talk about rebalancing.
And rebalancing is deceptively hard.
You have to maintain availability.
You want to move the absolute minimum amount of data and you have to keep the load even the entire time.
And before we get to the good strategies, we have to mention the big mistake again.
Hash mod N.
The cardinal sin of partitioning.
If your logic is based on hash key, mod N, and you add one node to make it N plus one, basically every key's calculation changes.
You end up having to move almost 100 % of your data.
It's operationally impossible.
So what's the first good approach?
It's called the fixed number of partitions strategy.
The idea is simple.
You create way more partitions than you think you'll ever need nodes, say a thousand partitions for a 10 node cluster.
So each node starts with a hundred of these little partitions.
And the key is a record is always assigned to the same partition number forever.
It never changes.
So when you add a new node, node 11, it doesn't recalculate everything.
It just steals a few of those whole partitions from the other nodes.
Only the assignment of the container moves.
This is what systems like react and Elasticsearch do.
That's simple, but it does mean you have to guess your maximum scale at the very beginning.
You do.
That number of partitions is a hard ceiling.
If you don't want that limit, you need dynamic partitioning.
Which you see in key range systems like HBase.
Right.
Here, the partitions just split themselves when they get too big, say over 10 gigabytes, and they can merge if they shrink.
The number of partitions adapts to your data volume.
But the source notes a critical problem there.
An empty database starts with just one partition.
Which is a guaranteed hotspot until enough data flows in to force that first split.
So operationally, you often have to do what's called pre -splitting, manually defining the initial boundaries to avoid that startup bottleneck.
Okay.
Let's talk operations.
Automatic versus manual rebalancing.
Automatic sounds great.
It just handles things.
It sounds great, but it carries a genuinely alarming risk.
It's the risk of a cascading failure.
How does that happen?
Imagine a node is just overloaded.
It's slow, but it's not dead.
If your automatic failure detector misinterprets that slowness as a dead node, it will immediately trigger rebalancing to move load away from it.
But rebalancing itself, copying all that data, creates a massive new load on the cluster.
Precisely.
So you take an already unstable cluster and you flood it with the heavy work rebalancing, which can easily overwhelm the remaining healthy machines and take the entire system down.
That is horrifying.
It really sounds like having a human in the loop, even if it's slower, is often the much safer choice.
Often, yes.
Manual is safer.
Automatic is convenient, but risky.
So we have our data distributed.
We know how to move it.
The final piece is request routing.
When my app wants to read key foo, how does it know which specific machine to talk to?
This is the service discovery problem.
You need an up -to -date, authoritative map of partitions to nodes, and there are three main ways to do it.
The first is where the client can just talk to any node, and that node forwards the request internally.
Right.
That's what Cassandra and React do.
The second uses a centralized routing tier, like a smart load balancer.
MongoDB's mongos process is a perfect example.
And the third is where the client itself is smart enough to know where to go.
The client is partition aware.
But all of these approaches, they need one single source of truth for that partition map.
And that's where something like ZooKeeper comes in.
Exactly.
A lot of systems like HBase and SoulCloud rely on an external coordination service like ZooKeeper to maintain that authoritative map.
All the routers and clients just subscribe to ZooKeeper for updates.
Which is a contrast to the systems like Cassandra that use a gossip protocol.
Right.
They avoid that external dependency by having the nodes just share the cluster state among themselves.
It's another one of those fundamental architectural choices.
Do you want the strong consistency of a central coordinator, or the fault tolerance of a decentralized protocol?
So let's try to pull all this together.
It feels like there are three major decision points here.
I think so.
First is your primary partitioning strategy.
Key range for sorted scans, or hashing for load distribution.
Second, your secondary index strategy.
Local indexes for simple writes, or global indexes for fast reads.
And third, your operational strategy for rebalancing and routing.
How you move data, and whether you'll risk automatic rebalancing, or stick with manual oversight.
At the end of the day, it's all about meticulously managing these tradeoffs.
Speed versus consistency, read efficiency versus write complexity.
That's the design of any scalable system, Partitioning works because it allows each partition to operate independently.
But here's where it gets really interesting.
What happens when an operation must write data to several partitions at once?
Maybe you're updating a user's account and logging the transaction, and that distributed operation fails part way through.
That incomplete failure.
That's the nightmare of distributed consistency, and it's a problem that partitioning makes possible by creating all those independent domains in the first place.
That definitely sounds like a deep dive for another time.
Thank you for sharing this incredible insight into how these massive systems actually work.
This was a necessary look at the mechanics that really keep the digital world running.