Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Guive.
You know that feeling when you save a file on your desktop and like milliseconds later?
It's just there.
On your phone, your tablet, your work computer.
That magic.
Yeah, that magic.
Well,
that seamless experience hides one of the most complex system design challenges out there.
Building a reliable, highly available cloud storage and synchronization service.
We're talking about the backbone of something like Google Drive.
Exactly.
And that's our mission today, to really tear down the architectural decisions that go into a platform like this.
We're talking about supporting, you know, all the basics.
File upload, download, and that real -time sync.
And notifications, too, for both web and mobile.
Right.
And the constraints are pretty intense.
Any file type up to 10 gigabytes with mandatory encryption for everything at rest.
But the real kicker is the scale, isn't it?
We're designing this for 10 million daily active users, DAU.
And that DAU figure, that immediately dictates what we call the non -functional requirements.
I mean, this can't just be a place to dump files.
We need absolute reliability.
And by that, we mean zero acceptable data loss.
It needs fast sync speed, super efficient bandwidth usage, which is so important for mobile users.
Oh, for sure.
And, of course, massive scalability and high availability.
So let's try to ground this in reality.
Can you give us some quick math to visualize what we're up against?
Yeah, the back of the envelope calculations are, well, they're daunting.
If you assume maybe 50 million total signed up users and you give each one just 10 gigabytes of free space.
Which is pretty modest, really.
Right.
You're already looking at provisioning a staggering 500 petabytes of total storage.
500 petabytes.
And what about the traffic, the queries per second?
Well, if each of those users upload, say, two files a day on average, then during peak hours our system has to handle something like 480 queries per second, just for the upload API alone.
So that level of constant traffic means you need a distributed architecture from day one.
You have to.
You can't fake it at that scale.
OK, so how do you even begin to build something that manages half an exabyte of data?
I guess you don't start distributed.
No, you never do.
Every system starts small.
So picture this.
One single server.
It's running Apache.
It's got a basic MySQL database for all the metadata, user info, file names, where things are stored.
And maybe a one terabyte hard drive bolted on.
Exactly.
One terabyte of local storage for the actual files.
The structure is super simple.
You have a root directory, maybe called drive.
And under that, each user gets their own little folder, their namespace.
But even that tiny scale, you need some pretty sophisticated APIs, right?
Oh, absolutely.
You need an upload API, and that immediately splits into two types.
A simple one for small files and a resumable upload for big ones.
That resumable one is critical, I'd imagine, for big files or if your network drops out.
It's essential.
You don't want a nine gigabyte upload to fail at 99%.
You also need a download API and a get file revisions API to handle version history.
So the single server is humming along until the inevitable happens.
The alert goes off.
You've used 99 % of your terabyte.
You have 10 megabytes left.
You're cooked.
The initial sort of tactical fix is data sharding.
You just spread the storage load across a few more servers based on, say, the user ID.
But that just buys you time.
Yeah.
It creates a much bigger problem.
And much worse problem.
Reliability.
If one of those servers hard drives dies, and it will, you lose all the data for thousands of users, poof, gone.
And that violates our number one rule.
Zero data loss.
Precisely.
So the foundational architectural move here is to decouple the file storage completely.
You move it to a commercial object storage service like Amazon S3.
That's total game changer.
It is, because S3 handles the scalability and, more importantly, the durability for you.
You're basically outsourcing the very hard problem of making sure the bits don't disappear.
And that durability comes from replication.
Can you break down the two types of replication we'd need?
Yeah.
There are two tiers.
First is same -region replication.
This means copies of your data live in different data centers, but all within the same geographic region like the US East.
It protects you from, say, a local power outage.
What about a real catastrophe?
Like an earthquake.
For that, you need cross -region replication.
This puts a redundant copy of your files in a completely separate geographical region like Europe.
It costs more, but it's your ultimate disaster recovery plan.
Okay, so that journey takes us from one little server to this globally replicated, decoupled, high -level design.
We have load balancers, API servers,
a dedicated metadata database, and this separate file storage layer.
And this is where things get really interesting, because the component roles start to get very specialized.
Next, we introduce a new critical layer to handle the file content itself.
The block servers.
Okay.
Tell me about these block servers.
What's their job?
They're not just passing files along to S3.
Not at all.
They are the heavy lifters.
When a client sends a file, a block server intercepts it and immediately splits it into smaller manageable pieces we call blocks.
A good size to think about is around 4 megabytes per block.
And why 4 megabytes?
That sounds very specific.
It's a key trade -off.
If the block size is too big, like 64 megs, a tiny edit forces you to re -upload a huge chunk.
If it's too small, say 64 kilobytes, the metadata overhead of tracking millions of tiny blocks would crush your database.
So 4 megs is the sweet spot.
It's the sweet spot for balancing network use and metadata load.
After splitting, the block server gives each block a unique hash and then uploads those blocks to cloud storage.
And while that's happening, the API servers are handling everything else.
User authentication, profile stuff, and, critically, updating the file metadata in the metadata DB, which itself is backed by a metadata cache for speed.
You also mentioned a notification service.
A pub subsystem.
Yeah.
It's designed to instantly push alerts to clients when a file they care about changes.
We also need an offline backup queue for users who aren't connected, and cold storage like S3 Glacier to archive old data cheaply.
Let's dig into those block server optimizations.
You mentioned bandwidth is a huge concern.
It's a major cost driver.
This is where Delta Sync is so powerful.
If you're editing a 100 megabyte presentation, but you only change the title on one slide.
You don't want to resend the whole hundred megs.
Exactly.
The client just sends the new hashes for the blocks that changed.
The server sees that only say block two and block five are different, and you only transfer those two 4 megabyte blocks.
It makes the sync feel instant.
And the block servers also handle things like compression and encryption before anything hits the cloud, right?
Yes.
There are gatekeepers for efficiency and security.
This brings us to a huge topic, strong consistency.
Why is this so critical for a file sync service?
Because if user A saves a file, user B has to see that update instantly.
It's completely unacceptable for different people to see different versions of the same file at the same time.
That's a catastrophic failure for collaboration.
So how do you guarantee that, especially when you've got high speed caches involved?
This is why we choose a relational database, an RDBMS for our metadata.
It's because they natively support ACD properties, atomicity, consistency, isolation, and durability.
So why is a standard SQL database so much better here than, say, a big NoSQL store like Cassandra?
It comes down to isolation and consistency.
A single save operation might need to update the file version, the content pointers, maybe user permissions all at once.
It has to be a single atomic transaction.
A transaction that either fully succeeds or fully fails.
Exactly.
Getting that kind of transactional integrity from a NoSQL database is incredibly difficult and error -prone to program yourself.
For something as vital as file metadata, you stick with the battle -tested RDBMS.
But even with strong consistency, you can still have sync conflicts, right?
Two users hit save at the exact same moment.
It's inevitable.
We handle it with a simple, reliable rule.
The first version to be fully processed by our system wins.
And the user who lost the race.
Their client gets a conflict notification.
We don't try to automerge.
That's a recipe for disaster.
Instead, the client is told to save the user's local version as a new conflicted copy.
You've probably seen this.
A file named something like Mito -conflicted copy.
The user then has to resolve it manually.
Let's walk a file through this system.
The upload flow is interesting because you mentioned it happens in parallel to feel faster.
Right.
It's split into two paths happening at the same time.
First, the metadata path.
The client immediately tells the API server, hey, I'm uploading this file.
The database marks the file as pending.
And at that exact moment.
The notification service tells all other relevant clients an upload for this file has started.
So they see that little syncing icon right away.
And while that's going on, the actual file content is moving along the content path.
Exactly.
The blocks are going to the block servers, getting processed, and uploaded to cloud storage.
Once the very last block is confirmed safe in storage, cloud storage sends a callback to our API servers.
Which then flips the status in the database from pending to uploaded.
And that triggers the final notification.
The file is now ready for download.
So the download flow is basically that in reverse.
A client get the notification or reconnects and pulls from the offline queue.
And it first requests the new metadata from the API servers.
That metadata tells it exactly which blocks it needs to download.
Then it just asks the block servers for those specific blocks, reconstructs the file locally, and it's done.
Let's talk about that notification service.
You have technologies like WebSockets that offer this persistent two -way connection.
Why did you say the system opts for long polling instead?
It's a resource management decision.
WebSockets are fantastic for things like chat.
Or you have constant, high -volume, bi -directional traffic.
But file sync notifications aren't like that.
Not at all.
They're relatively infrequent.
And the communication is mostly one way from the server to the client.
Maintaining millions of persistent, always -on WebSocket connections is actually a very resource intensive on the server side, even if they're idle.
So long polling is more efficient for this use case.
It is.
The client opens a connection and the server just holds it open, waiting.
As soon as there's a file change, the server sends the response, closing the connection.
The client gets the update and immediately opens a new connection to wait for the next event.
It's much lighter on the system.
OK, let's talk about saving money.
With 500 petabytes of storage plus versioning, optimization must be a top priority.
It's paramount.
First, data deduplication.
Since every 4 -megabyte block has a unique hash, if 10 ,000 users upload the exact same picture, we only store each of its blocks once.
That saves a tremendous amount of physical storage.
That's huge.
What's next?
An intelligent data backup strategy.
You can't keep infinite versions of every file forever.
You have to set limits, maybe keep daily backups for a week, weekly for a month, and so on.
You prioritize recent versions.
And the third strategy is cold storage.
Data that hasn't been touched in months or years gets automatically moved to a cheaper storage tier like S3 Glacier.
It's still there if you need it, but it costs a fraction as much to store.
Last big topic.
Resilience.
In a system this complex, things will fail.
How does it recover?
Every component has a failover plan.
The load balancer has a hot standby.
If a block server dies mid -upload, another one just picks up the job.
And cloud storage is handled by that cross -region replication we talked about.
What about the metadata database?
That seems like a critical point of failure.
It is.
The master database has several slave replicas that are always in sync.
If the master fails, an automated process promotes one of the slaves to become the new master, and all right traffic is redirected to it.
What's the trickiest failure to handle?
It might actually be the notification service.
If one of those long polling servers goes down, you could have millions of open connections just severed.
And all those clients try to reconnect at once.
Exactly.
You get this thundering herd problem.
The system can handle it, but the sheer volume of simultaneous reconnects can definitely slow things down for a few moments.
So if we take a step back, the core achievement here is a system that balances that strict, strong consistency with really efficient bandwidth use through delta sync and fast notifications.
It's a complex balancing act.
And here's a final trade -off for you to think about.
We chose this block server approach.
But an alternative would be to let the client upload files directly to cloud storage.
That sounds faster.
It's only one hop.
It is faster, but it's a huge risk.
You lose all that centralized control.
You mean all the logic for chunking, compression, deduplication?
And most importantly, encryption.
All of that complex, fragile logic would have to be implemented perfectly on every single client, iOS, Android, the web.
It's a recipe for bugs and security holes.
And I imagine client -side encryption is just fundamentally less secure than doing it in a controlled server environment.
Much less secure.
Centralizing that work in the block server is safer and way more reliable.
So what's a potential next step?
How could this system evolve?
A really smart evolution would be to decouple the online -offline logic into its own dedicated service.
You could call it a presence service.
A presence service.
Just to track who's online.
Exactly.
It's only job is to know which clients are currently active and can receive a push notification.
By making that a separate, reusable building block, any other service in the company, a chat app, a metrics tool, could just query it.
Instead of everyone building their own online -offline tracking logic from scratch.
Precisely.
It makes the entire platform more modular and robust.
That's a fascinating place to leave it.
It really shows that designing for this scale means thinking about every detail, from the 4 megabyte block size all the way up to platform -wide service architecture.
Thank you for guiding us through it all.