Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome back to The Deep Dive.
Today, we're really getting into the weeds on something fundamental.
We're designing a massive scale chat system from the ground up.
That's right.
And when we say massive, we're talking about 50 million daily active users.
Think WhatsApp, think Instagram DMs.
That is a staggering number.
So our source material today,
it gives us a blueprint for supporting that kind of scale.
It does.
The mission is pretty clear.
We need to support one -on -one chats, small group chats, up to a hundred people.
And we have to keep that chat history forever.
Wow.
And on top of that, we need real -time online presence and full support for multiple devices.
And a key constraint here, right, is that we're dealing mostly with text.
Short messages, less than a hundred thousand characters, but the delivery has to be lightning fast.
Exactly.
That combination of low latency, permanent storage, and, you know, massive horizontal scale.
That's what's going to drive every single decision we talk about today.
Okay.
So let's start at the very beginning.
How do two people, say user A and user B, actually talk to each other?
The first rule is clients never talk directly to each other, ever.
Right.
They both connect to a central chat service, and that service is the middleman.
It receives the message from the sender.
It figures out where the recipient is, and it gets the message to them.
So the sender side, user A sending the message, that feels, well, that feels pretty simple.
It is, relatively speaking.
Because the client is initiating that connection, you can just use good old HTTP.
Okay.
You'd probably use a keep -alive header to maintain the TCP connection, which, you know, just cuts down the overhead of setting up a new connection for every single message.
Makes it pretty efficient.
But the other side of that coin, the receiver side,
that's the hard part.
The server needs to tell user B, hey, you have a message.
But HTTP doesn't really work that way.
No, it's request response.
So historically, we've had a few stabs at solving this.
The first and most, let's say, primitive way was polling.
Where the client just asks the server over and over, anything for me, anything for me, anything for me.
Exactly.
And you can imagine with 50 million users, most of whom are not getting a message every second, it's a colossal waste of resources.
Just millions and millions of no replies every minute.
Precisely.
So the industry evolved.
The next step was long polling.
Okay.
So a bit smarter.
The client asks, but the server doesn't answer right away.
Right.
The server holds that connection open until one of two things happens.
Either a message actually arrives or a timeout is hit, maybe 30 seconds or so.
Then the client immediately sends a new request and the whole cycle starts again.
That sounds better, but I'm guessing at our scale, 50 million DAU, there's still some major drawbacks.
Oh, huge ones.
And they're architectural.
Think about your load balancers.
You've got, say, a hundred stateless servers.
Yep.
Your load balancer is just using round robin, spreading the load.
So the server that gets the message from the sender, let's call it server X, might be totally different from the server holding the receiver's long polling connection, server Y.
And server X has no idea where user B is.
The chain is broken.
The chain is broken instantly.
You have to start adding all this extra complexity to make connections sticky, which kind of defeats the point of having stateless servers in the first place.
Plus the server has a really hard time knowing if a user is truly disconnected, right?
A terrible time.
If your phone goes into a tunnel, the server only finds out when the 30 -second timeout finally hits.
It's just, it's clunky.
So this is where the modern solution comes in.
The thing that solves all these headaches,
WebSocket.
WebSocket is the answer.
It's the standard now for this kind of thing.
It starts out as a normal HTTP request.
Okay.
But it includes this special upgrade header and a handshake.
And that handshake transforms the connection into a persistent bidirectional link.
Bidirectional being the keyword.
The server can push data to the client, can send data to the server.
Exactly.
And that simplifies the whole design.
We use that one single WebSocket connection for both sending and receiving.
And because it runs over standard ports, like 80 or 443, firewalls almost never block it.
It's just robust.
Okay.
So we've got our communication protocol locked in.
Now let's zoom out.
What does the high -level architecture look like?
At this scale, you have to break everything apart.
You can basically group the components into three categories.
Let's hear them.
First, you have your stateless services.
These are your traditional web servers sitting behind a big load balancer.
They handle things like login, user profiles, creating a group.
The normal request response stuff that doesn't need to remember anything session to session.
Precisely.
And a really, really important part of this group is a service called service discovery.
That sounds like an air traffic controller.
That's a great way to think about it.
Its only job is to tell a connecting client, okay, based on your location and our current server load, here is the DNS name of the best chat server for you to connect to.
Got it.
So that brings us to the second category, which must be the stateful part.
Correct.
The chat servers.
This is the only truly stateful piece of the core system because these servers are the ones that actually maintain those persistent WebSocket connections.
And you'd have thousands of these.
Thousands.
And the idea of stickiness.
Once your client connects to a chat server, you stay with that server for as long as possible.
I want to circle back to something you mentioned.
The single server trap.
I mean, technically couldn't a modern server with enough RAM hold say a million connections?
It could.
The math works out to maybe 10 gigs of RAM for a million concurrent connections, which is totally feasible on modern hardware.
But you would never ever do it because it's a catastrophic single point of failure.
If that one super server goes down, you've just knocked a million users offline in an instant.
Reliability is everything.
You have to design for failure, which means lots of smaller redundant servers.
Horizontal scaling from day one.
So stateless services, stateful chat servers.
What else do we need?
A few other key pieces.
You need dedicated presence servers just to manage who was online and who was offline.
You need third party integration with push notification services like Apples and Googles to wake up offline phones.
And of course, the database to store all this history.
The key values soar.
Yes.
And that choice is absolutely critical.
Let's dig into that.
Data persistence.
We're talking potentially 60 billion messages a day.
Why not just use a giant sharded relational database like MySQL?
Relational is great for some things.
Your user profiles, your friends list data with strong relationships.
Perfect.
But for a chat history,
the access patterns are completely different.
How so?
The volume is insane and you have this long tail effect.
Recent chats are accessed constantly, but messages from three years ago are almost never touched.
Sharding a relational database to handle that kind of right volume while also providing low latency access is a nightmare.
It gets incredibly complex and expensive.
So the answer is a NoSQL approach, a key value store like Cassandra or HBase.
Exactly.
They are built for one thing.
Massive, easy horizontal scaling.
You need more capacity.
You just have more nodes.
They provide super low latency for looking up data by a key, which is exactly what we're doing.
We're saying give me the messages for this specific chat.
It's a specialized tool for a specialized job.
Perfectly put.
Okay.
So if we're using a key value store, let's talk about the key itself.
The message ID.
This is surprisingly tricky.
It has to be unique, but it also has to be sortable by time.
Yes.
And you can't just rely on a timestamp.
Why not?
Because in a distributed system, you can easily have two messages created at the exact same millisecond in different parts of the world.
There's no way to guarantee their order if you just use a timestamp, and that breaks the user experience.
And we can't use something like MySQL's auto increment because we're not in SQL.
The source material also says to avoid a big global ID generator like Twitter's Snowflake.
Wow.
Yeah, it's overkill.
It introduces a central bottleneck and a lot of complexity to ensure an ID is unique across the entire system.
And the insight here is you don't need that.
A message ID only needs to be unique and sortable within a single chat.
Ah, I see.
So the solution is a local sequence number generator.
The ID is only unique within its own channel.
Exactly.
It's a pragmatic choice that simplifies everything.
So for a one -on -one chat, the message is the key.
For a group chat, it's a composite key.
The channel plus the message ID.
That's all you need to keep messages in order for the user.
Brilliant.
Okay.
Let's put this all together.
Let's trace a user connecting and sending a message.
What's the first step?
First step is login.
User A's login request hits the load balancer, gets routed to a stateless API server for authentication.
Standard stuff.
Once they're authenticated, that API server makes a call to service discovery.
Service discovery, maybe something like Zookeeper, looks at its list of available chat servers and picks the best one for user A based on geography and load.
And it sends that server's address back to the client.
Exactly.
Then user A's client opens that direct persistent WebSocket connection to its assigned chat server.
Okay.
Connection established.
Now A sends a message to B.
What happens?
A's message travels over the WebSocket to their chat server.
Let's call it server one.
Server one immediately gets a new local message ID.
Then it does two things in parallel.
It writes the message to the key value store for permanent history.
And it also pushes the message into something called the message sync queue.
You can think of this like user B's personal inbox queue.
And then it has to figure out if user B is online.
Right.
It checks.
If B is online, the system knows B is connected to, say, server two.
So server one just forwards the message to server two, which sends it down B's WebSocket instant delivery.
And if B is offline.
Then the system triggers a call to the push notification servers, which sends that little alert to B's phone.
That makes sense.
What about multi -device sync?
I've got my laptop and my phone.
How do they both stay up to date?
This is where that sortable message ID is so elegant.
Each of your devices, your laptop, your phone, they each keep track of one number.
Kermax message ID.
The ID of the last message it's seen.
Exactly.
So when you open the app on your laptop, it just makes a simple request.
Hey, server, give me all messages for me that have an ID greater than my current max ID.
That's it.
Wow.
That is clean.
The client is responsible for knowing what it's missed.
Yep.
It delegates that state to the client, which makes the server's job much, much simpler.
Okay.
Let's scale this up to a small group chat, say 50 people.
How does that change the flow?
The core difference is a concept called fan out on write.
Fan out on write.
The beginning is the same.
User A sends a message.
It gets an ID.
It's stored in the KV store.
But instead of putting that message into one person's sync queue.
We copy it.
You copy it into the sync queue of every single member of that group.
So 50 copies for a 50 person group.
That sounds expensive.
It is a trade off.
It's more storage for sure, but the benefit is huge.
It makes the client side logic incredibly simple.
Your phone only ever has to check one place, its own inbox for new messages, whether they're from a one -on -one chat or a group chat.
And for a group of up to a hundred, that's an acceptable cost.
It's acceptable.
But this is also why the source is clear.
This approach doesn't work for, say, a 100 ,000 person group.
The right amplification would be crippling.
You'd have to switch to a fan out on read model for that.
Makes total sense.
Okay.
Let's shift gears to that little green dot next to a name.
Online presence.
That's handled by the presence servers.
Right.
And tracking login and log out is easy.
The hard part is handling unexpected disconnections.
Like when your phone goes through a tunnel.
Exactly.
If we were naive about it and marked you offline, the second your web socket connection dropped, the user experience would be awful.
Your status would be flickering on and off constantly for your friends.
So you need a grace period.
This is the heartbeat mechanism.
It is.
The client is constantly sending a tiny little heartbeat event to the presence server, maybe every five seconds.
Just a little, I'm still here, ping.
And the server only marks you as offline if it stops seeing those pings for a while.
Exactly.
If it doesn't get a heartbeat for a longer threshold, say 30 seconds, then it decides, okay, this user is actually offline.
It smooths out all the noise from temporary network blips.
So when my status does legitimately change, I log out how do all my friends find out in real time?
This is the online status fan out, right?
And this uses a classic publish -subscribe model, or Pub -Sub.
When your status changes, the presence server publishes that event to a bunch of different channels.
One for each of my friends.
One for each friend.
So if you have friends B and C, there's a channel for you in B and a channel for you in C, B and C.
B and C are subscribed to their respective channels, and they get that update instantly over their web socket.
And again, I'm sensing a theme here.
This works for a typical friends list, but not for a massive group.
Precisely.
You would not fan out a status update to 10 ,000 people.
It's too expensive.
For large groups, you switch the model.
The client has to explicitly ask for the status of members when, for example, the user opens the member list.
It's always about balancing real -time data push versus scalability.
This has been an incredibly detailed breakdown.
We've hit on using web sockets, splitting services into stateless and stateful, why key value stores are essential at this scale, and these clever little tricks like local message IDs and heartbeats.
I think the key takeaway really is understanding those trade -offs.
At 50 million users, you're always choosing horizontal scaling.
You're choosing simple, fast reads and writes over complex relational queries.
And sometimes you're choosing to duplicate data, like with fan out on write, if it makes the whole system simpler and more robust.
We've built a pretty amazing system for text messages.
So here's a final thought to leave you with.
What happens when this system has to handle photos and videos?
Oh, that's a whole other level.
Think about it.
You suddenly need large file upload services, complex compression pipelines,
integration with cloud storage like S3, and a whole new layer of caching on the client side just to handle that volume.
It's a fascinating next step.
It really is.
A deep dive for another day, for sure.
Thanks for digging into these sources with me.
This was great.
Always a pleasure.
We'll catch you next time on the deep dive.