Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Dive, where we extract the most critical insights from the dense world of engineering and research, giving you the expert shortcut.
Today we are undertaking a really foundational deep dive.
We're looking at the design concerns that govern pretty much every modern application, the ones that handle massive amounts of data.
To kind of frame this, I want to start with a quote from the computing pioneer Alan Kay.
Oh, I think I know the one you're thinking of
about the internet.
That's the one.
He said, the internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man -made.
It's a perfect quote for this, because that's the goal, right?
An abstraction so good it just feels invisible.
Exactly.
And we're looking at the core data systems that are supposed to work so reliably that when they do fail, it feels like a natural disaster.
And when we say modern apps, we're talking specifically about data intensive applications.
Right.
And that's a key distinction.
The bottleneck here isn't your raw CPU speed anymore.
It's the sheer volume of data, its complexity and how fast that data is constantly changing.
So what are the building blocks we're talking about?
You've got databases to store data permanently, caches to speed up reads,
search indexes to, well, search and filter things.
And then the more asynchronous stuff.
Exactly.
Stream processing for messaging and queues, and then batch processing, you know, for crunching huge amounts of data that's piled up over time.
Most developers I know tweet these as totally separate tools, but you're saying we should lump them all together as data systems.
Why is that?
It's because the lines are just blurring so much.
A tool that was designed for one thing is now doing another.
Like what?
For instance, you see Redis, which is a cache being used as a pretty effective message queue.
And on the flip side, a message queue like Apache Kafka now has durability guarantees that make it feel almost like a database.
And the second reason is even more important.
Almost every major application today is a composite data system.
Composite meaning you're stitching things together.
Precisely.
You're using your own application code to glue together a database, maybe a main cache layer for caching, plus an elastic search index for search.
No single tool does it all.
And that right there is where the developer's job totally changes.
It does.
Because if you're the one stitching these things together, you're not just a feature developer anymore.
You're basically becoming a data system designer.
You are.
And you're suddenly responsible for these huge guarantees, like making sure that what you write to the database actually shows up correctly and consistently in your cache or your search index.
That's not trivial.
So to navigate all that complexity, you have to master these three essential requirements.
The three pillars of system design.
We're going to guide you through them.
Reliability, scalability, and maintainability.
RSM.
RSM.
Let's start with the first one.
Reliability.
At its core, it's pretty simple.
It just means continuing to work correctly even when things go wrong.
But to talk about that, we first have to define what wrong even means.
Okay, yeah.
There's a critical distinction here.
We need to separate a fault from a failure.
So a fault is when one single component messes up,
a disk returns bad data, a network cable gets unplugged.
And a failure is when that fault
actually causes the whole system to stop providing service to the user.
The whole thing goes down.
So the goal is to be fault tolerant.
Exactly.
Or resilient.
You build systems that can anticipate specific kinds of faults and then you prevent them from cascading into a full -blown failure.
But I guess you have to accept you can't be tolerant of everything.
Oh, absolutely not.
That would be impossibly expensive.
I mean, you're not designing your system to survive a black hole swallowing the planet.
You design for the faults that are likely to happen.
And here's where it gets kind of paradoxical.
To make sure your system is reliable, you actually have to break it on purpose.
Yes.
You're talking about things like the Netflix Chaos Monkey.
Exactly.
A system that just goes around randomly killing processes and virtual machines in your production environment.
It sounds completely insane, but it's brilliant.
Because the most dangerous bugs aren't usually in your main logic.
They're in your error handling code, your recovery code.
By constantly causing faults,
force that resilience machinery to be exercised and tested all the time.
You prove your system is reliable, even when it's parts are not.
Okay.
So let's break down the types of faults we're dealing with.
Let's start with the obvious one.
Hardware faults.
Right.
Hard drives crashing,
RAM going bad.
You look at the stats for a hard disk, the mean time to failure is something like 10 to 50 years.
Which sounds pretty safe.
For one disk, yeah.
But if you have a cluster with 10 ,000 disks, statistically you should expect one of them to fail every single day.
Wow.
So the classic solution was redundancy.
You know, rate arrays,
dual power supplies, that sort of thing.
But that's not really how we think anymore.
What's changed?
The sheer scale of everything.
And cloud platforms like AWS, we now just accept that entire virtual machines can just vanish without warning.
So the responsibility has shifted away from hardware redundancy and towards software fault tolerance.
And that has other benefits too, right?
Like upgrades.
A huge one.
If your software can tolerate a whole node failing, you can do rolling upgrades.
You take one machine down, patch it, bring it back up, and move to the next one.
All with zero system downtime.
Okay.
So that's hardware.
What about software errors?
The bugs in our own code.
Those are in some ways much scarier.
Because unlike a hardware fault, which is usually random and uncorrelated, a software bug is systematic.
Meaning if it happens on one machine?
It's going to happen on every machine that gets the same bad input.
A single bug can take down your entire fleet at once.
Think of things like the leap second bug in the Linux kernel a few years back.
Or cascading failures, where one slow service brings down another, which brings down another.
A domino effect?
A total collapse.
And then there's the biggest cause of outages, which might surprise some people.
Let me guess.
It's us.
Human error.
It's always us.
Engineers with the best intentions making a configuration mistake.
It's the number one cause of major outages.
So how do you mitigate us?
You make it hard to do the wrong thing and very easy to recover.
That means building really good abstractions and APIs that guide people.
It means having great sandbox environments where you can test things safely.
And when things do go wrong?
You need to be fast.
Fast rollbacks are key.
But most important is visibility.
Great monitoring, great telemetry metrics, error rates.
You need to see the smoke before it becomes a five alarm fire.
Okay, so that's reliability.
Let's move to pillar number two.
Scalability.
This is all about handling increased load.
Right.
And you should never, ever ask, is this system scalable?
That's a meaningless question.
The right question is?
The right question is, if our system load doubles in this specific way, what's our plan to cope with that growth?
It has to be specific.
And to be specific, you first have to describe the load itself.
You need load parameters.
You have to quantify it.
Is it request per second?
Is it the read to write ratio in your database?
The number of active users?
Your architecture depends entirely on what you're optimizing for.
There's a great example of this from Twitter's home timeline design back in, I think, 2012.
Classic.
The raw number of tweets, something like 12 ,000 writes per second, wasn't actually the big problem.
The real challenge was something called fan out.
The fan out, yeah.
So their first attempt, let's call it method one, was a read heavy approach.
When you wanted to see your timeline, their system would go find everyone you follow, grab all their recent tweets, and merge them together for you on the fly.
Sounds simple enough.
Simple on the right side, sure.
But the read side was a disaster.
At 300 ,000 timeline requests per second, the database just completely fell over.
You can't do that much complex querying, emerging constantly.
So they flipped the model completely, method two.
They flipped it.
It became a write heavy approach.
When someone posted a tweet, the system would immediately go and push that tweet into a dedicated cache like a little mailbox for every single one of their followers.
To read my timeline just means pulling from my personal mailbox.
Super cheap.
Super cheap to read.
But the write side became a scaling nightmare of its own.
That 4 ,600 tweets per second average suddenly became 345 ,000 writes per second into all those caches.
And then a celebrity tweets.
And it explodes.
One tweet from someone with 30 million followers meant 30 million write operations.
Instantly.
The cost of a single write became astronomical.
So what was the solution?
A hybrid.
For most people, they used method two, the fan out on write.
But for the celebrities, they didn't.
Tweets from those huge accounts were handled the old wayfetched and merged in when someone read their timeline.
It was a pragmatic mix.
It's a perfect example of how load parameters define everything.
Now, once we've defined load, we need to talk about performance.
Right.
For batch jobs, we usually care about throughput records per second, that sort of thing.
For online systems, it's all about response time.
And we should probably clarify some terms here.
Response time isn't the same as latency.
Good point.
Latency is just the time your service is actively working on the request.
Response time is what the client actually sees.
It's the latency plus all the network travel time, queuing delays, everything.
And it's never a single number, right?
It's a distribution.
Using just the average response time is a terrible idea.
It's basically lying to yourself.
This is where you have to use percentiles.
So instead of the average, we look at things like the 95th percentile or P95.
Or P99 or even P99 .9.
These are your tail latencies.
If your P99 is 500 milliseconds, it means 99 % of requests are faster than that.
But one out of every hundred is slower.
And that one user is having a bad experience.
And often, that one user is your most important customer.
Amazon found that out.
They focus heavily on the 99 .9th percentile.
Because the users hitting those slowest response times are often their highest volume, most valuable customers.
They even quantified it.
What was the number?
A hundred millisecond increase in response time cuts sales by 1%.
It's huge.
And those tail latencies can get amplified, right?
Oh, massively.
It's called tail latency amplification.
If your user's request has to make, say, 10 parallel calls to different backend services, your user has to wait for the absolute slowest of those 10 calls to finish.
So even if each service is individually fast most of the time, the chance of at least one of them being slow on any given request goes way up.
So how do we actually implement scaling?
We talk about scaling up versus scaling out.
Right.
Scaling up is vertical scaling.
Yeah.
You just buy a bigger, more powerful machine.
Scaling out is horizontal.
You distribute the load across lots of smaller, cheaper machines.
And most big systems have to scale out.
They do.
And you want to do it elastically.
Elasticity is where you automatically add or remove resources based on the current load, not just manually scaling up for a big event.
So what's the big takeaway on scalability?
That there is no magic scaling sauce.
There's no one size fits all answer.
Your architecture has to be custom built for your specific load parameters.
Which brings us to our final pillar.
Maintainability.
Making our own lives easier in the future.
And this one is so often overlooked.
The vast majority of the cost of software isn't in the initial build.
It's in the years of maintenance, bug fixing, and adapting it to new requirements.
So the goal here is to avoid creating a legacy system that everyone is afraid to touch.
And we do that with three design principles.
The first is operability.
Basically making life easy for the operations team.
What does that look like?
It means the system gives you great visibility through monitoring.
It's easy to automate.
It doesn't depend on any one specific machine being alive.
It behaves predictably and has good documentation.
You shouldn't need to page the original developer at 3 a .m.
Okay.
Second principle.
Simplicity.
Which is all about managing complexity so your project doesn't just decay into a big ball of mud.
And there's a key distinction here between two types of complexity.
Yes.
There's inherent complexity, which is just part of the problem you're trying to solve.
But then there's accidental complexity, which is the complexity we create ourselves through bad implementation choices.
And the best weapon against that is?
Abstraction.
A good abstraction hides an enormous amount of implementation detail behind a clean, simple interface.
Think about SQL.
It hides all the complexity of how data is stored on disk, how concurrency is managed, how crash recovery works.
It just lets you focus on your data.
And the final principle is evolvability.
Which is just a fancy word for agility at the system level.
How easy is it to change the system when requirements change?
How easy is it to adapt to things you never anticipated?
And this ties right back to simplicity.
Simple systems are just easier to evolve.
Okay.
Let's wrap this up.
A quick recap.
The three pillars you have to get right for any data system.
Reliability, which is about coping with faults.
Scalability, your plan for coping with increased load.
And maintainability, how you cope with constant change and make life easier for your future self and your team.
And remember, as soon as you start stitching together a database, a cache, a search index, you're not just an application developer anymore.
You are a system designer.
You're the one responsible for making sure RSM is guaranteed across that entire composite system.
Which leaves us with a final provocative thought.
If we know that good abstractions are the key to simplicity, but they are so incredibly hard to find in distributed systems, what is the trade -off we're making?
What do we lose when we constantly choose to stitch together many general purpose tools instead of maybe trying to build one single solution that's perfectly tailored to our very specific application needs?
Something to think about.