Chapter 2: Nearby Friends System Design

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today we're architecting something really big, a massive real -time system from the ground up.

Our mission is to build the backend for a nearby friends feature.

You know the kind where you can opt in and see which of your friends are currently close by on your phone.

And if you're thinking, oh, that's just a proximity service like finding a nearby coffee shop,

you have to stop right there.

This is a completely different beast because the one key concept,

data dynamism.

Data dynamism.

What do you mean by that exactly?

Well, a coffee shop's location is static.

It doesn't move, but user locations.

They're constantly moving targets, updating at a really high frequency.

We're dealing with a real -time problem at an immense scale.

And that dynamism is the core challenge.

It changes everything.

Okay, so let's unpack this and maybe establish some ground rules first.

We need a few core assumptions to even make this solvable.

Right.

First, what does nearby even mean?

For this, we'll define it as five miles, calculated using a simple straight line distance.

But importantly, that number has to be configurable.

Okay, five miles.

Got it.

We also have a requirement to store location history, probably for future machine learning analysis, which means we'll need a separate right path for that.

And maybe the most critical rule for the real -time part,

inactivity.

If a user hasn't reported their location for more than 10 minutes, they're considered inactive.

They had to disappear from everyone's list.

And how do we handle that?

We can use Redis' native time to live or TTL.

We set a 10 -minute timer on their location data.

If they don't update, Redis just poof, auto -purchase it.

It simplifies our cleanup processes immensely.

That's clever.

So those constraints really define the performance envelope.

All right, let's jump straight to the numbers, because the scale is going to dictate every single decision we make here.

We are designing for one billion total users, 100 million daily active users.

And we're assuming about 10 % of those, a massive 10 million people, are concurrent users, all reporting their location.

And to balance, you know, freshness with battery life, each user updates their location every 30 seconds.

Okay, so that's the setup.

And this is where the math starts to get, frankly, a little terrifying.

We have 10 million concurrent users updating every 30 seconds.

That works out to roughly 334 ,000 location updates per second, QPS, just coming into our servers.

That's a huge initial load.

But the system has to do more than just receive the data, right?

Exactly.

It doesn't just stop there.

The conservative assumptions, say an average user has 400 friends, and maybe 10 % of those friends are online and nearby at any given time.

You take those 334 ,000 initial updates, you multiply that by 400 friends, and then by that 10 % who are online.

The result is that our back end needs to forward an astonishing 14 million location updates per second.

14 million.

Wow.

Just to put that number in perspective for everyone, that's like routing a message for every single person in New York and London combined every single second.

That's the monumental challenge.

And that number defines our requirements.

We need extremely low latency, of course.

But while reliability is important, the sources say occasional data loss is actually acceptable.

And thankfully, with location data, eventual consistency is fine.

A delay of a few seconds isn't a catastrophe.

That flexibility is a huge help.

It lets us prioritize high throughput.

So this leads us right into the high level architecture.

You mentioned the tight power budget on mobile devices.

Why don't we just use a peer -to -peer system for this?

Peer -to -peer is a total non -starter in the mobile world.

Mobile connections are just too flaky, and the coordination it requires would absolutely crush a device's battery life.

No, we have to use a shared back end approach.

We centralize all the connection management and the routing.

Okay.

Shared back end it is.

So conceptually, I'm picturing a central load balancer that distributes all the incoming traffic.

And it splits that traffic between two different clusters, right?

Our standard stateless restful API servers, and then the really crucial part, the stateful WebSocket servers.

Exactly.

The restful API servers handle all the usual stuff authentication, managing friendships, user profiles, all the easy -to -scale components.

The real complexity is in the WebSocket or WS servers.

This is a cluster of stateful nodes.

Each one maintains a single, persistent, long -running WebSocket connection for each client.

So that's the dedicated pipe for all the real -time updates.

That's the pipe, and it also handles the client initialization.

And supporting all this is our data layer.

So we'd have a location history database, something write -optimized, maybe Cassandra, sharded by user ID for all that ML data, but the real -time core that relies on the Redis location cache.

And that cache only stores the most recent location for active users.

Its killer feature is that time -to -live, the TTL.

Every time a user reports in, we update the cache, we refresh their 10 -minute TTL.

If they go silent, Redis purges the data.

It's a self -cleaning system.

So we have connection managers, the WS servers, and we have data storage, the cache, and the DB.

But that 14 million update problem, that requires a dedicated routing layer, and that's where the Redis PubSub server comes in.

Precisely.

This cluster acts as a lightweight message bus.

The beauty of Redis PubSub is that channels or topics are incredibly cheap to create.

We can literally afford to give every single active user their own unique channel, which is perfect for this kind of high -volume routing task.

Okay.

The architecture is in place, so let's walk through the exact mechanics of a single location update, the eight -step workflow.

A user's phone sends an update, it hits the load balancer, and gets routed to their dedicated WebSocket server connection.

Right.

The WS server then kicks off a few processes in parallel.

It saves the location to the history DB for the ML folks.

It updates the Redis location cache, which refreshes that critical TTL.

And it also saves the location locally, right in the connection handler variable.

And finally, it publishes that update to the user's unique channel on the Redis PubSub server.

And that publish action, that's the fan -out trigger.

Redis PubSub broadcasts the message to all subscribers.

And in this case, the subscribers are the WebSocket connection handlers for all of that user's online friends.

Yes.

So each friend's WS handler gets the message.

And this right here is the core architectural trade -off that saves us.

The friend's WS handler is the one that computes the straight line distance using the location data it is stored locally.

Wait, hold on.

Why do the calculation there on the friend server, why wouldn't the sender's server do that check before the message even hits PubSub?

That's the critical insight.

Think about it.

If the sender's WS server did the check, it would have to query the location of all 400 friends and do 400 calculations every 30 seconds.

That would create enormous complexity and database pressure on the sending side.

Ah, I see.

So by doing it on the subscriber side, after the fan -out, we avoid all that.

We achieve monumental efficiency.

We let Redis PubSub do its simple, fast job fanning out the message copies.

Then, only after the message is already at the friend's server, we use that cheap local computation to filter the data.

So we're dropping, what, 90 % of the messages right there at the WS server because only about...

Exactly.

And the final step is, only if the distance is within that five -mile radius is the update actually sent down the WebSocket to the friend's mobile device.

This saves a tremendous amount of outbound network and, most importantly, it protects the client's battery.

That filtering step is the real efficiency engine here.

That makes perfect sense.

The cost of a little CPU calculation is nothing compared to waking up the phone's radio.

Okay, moving into scaling challenges.

Let's talk about those stateful components.

The WS servers are stateful, holding millions of connections.

What happens when we need to scale down or replace a node?

Right, because they're stateful, you can't just kill them.

If you want to remove a node, you have to mark it as draining on the load balancer first.

The LB stops sending it new connections, which lets the existing connections close gracefully as users disconnect and reconnect naturally.

It's a non -negotiable process for stateful clusters.

Okay, so that's planned maintenance.

But what about when a user first opens the app?

How does the WS server get them the initial list of nearby friends?

Right, client initialization.

It's a seven -step process handled by the WS server.

It loads the full friend list from the database, checks the Redis cache for the latest location of each friend.

And remember, the TTL means inactive friends are already filtered out.

It does the initial distance checks and then seeds the client with that first list.

And we make an important trade -off here, right?

The server subscribes the user to the PubSub channels for all of their friends, even the ones who are offline.

Correct.

And that might sound wasteful, but since creating a channel in Redis is so lightweight, we accept a tiny bit of extra memory use.

Why?

To avoid the huge architectural complexity of constantly subscribing and unsubscribing every time a friend comes online or goes offline, we just front -load the subscriptions for simplicity.

That's a smart trade -off.

Now, let's talk about the data cache.

With 334 ,000 updates per second, a single Redis instance just won't cut it.

Not even close.

But since each user's location data is independent, the solution is pretty straightforward, sharding the location cache by user ID.

This spreads that massive update load evenly across dozens of Redis servers.

The trade -off is availability.

If one shard goes down, those users might miss updates for a minute or two, but as we said, that's acceptable here.

Okay, and that brings us to the ultimate bottleneck, the one we identified with our math at the very beginning.

The Redis PubSub cluster, tasked with handling that 14 million updates per second forwarding load.

This is the system's Achilles heel.

It is.

If we conservatively estimate a modern Redis server can handle about 100 ,000 pushes per second, we're talking about needing 140 Redis PubSub servers, just for routing.

And our analysis shows the bottleneck here is overwhelmingly CPU usage from all the context switching and network IO.

It's not memory.

Managing 140 servers is hard enough, but these are stateful because of the subscriber list.

That sounds like an operational nightmare.

How do you distribute the user channels effectively?

You have to use consistent hashing, sharding the cluster based on the publisher's user ID.

This maps a user's channel to a specific PubSub server.

And this is so crucial because traditional modulus sharding would be a catastrophe.

When you add or remove a server,

almost everything has to move.

Consistent hashing ensures only a minimal number of channels are impacted.

Right.

And to coordinate all this, so the WS servers know which of the 140 PubSub servers to talk to, you need a service discovery component.

You do.

Something like Etcetera or ZooKeeper to maintain that hashing configuration.

It's the authoritative map.

The WS servers check that map to know where to publish or subscribe.

And this stateful nature is what creates the big operational risks.

If you resize the cluster, the hashing changes and that forces channels to move, which triggers what you call a mass resubscription event.

And that's the danger zone.

Every affected online friend's WS handler has to suddenly check the new config and resubscribe to the channel's new home.

It creates a massive transient load spike.

This is why we almost always over -provision these clusters and only resize during the quietest periods.

Replacing a single failed node is much safer than a full resize.

Let's quickly hit some cases.

Adding or removing a friend sounds simple.

Just a client callback tells the WS server to subscribe or unsubscribe from a channel.

But what about the whale users?

The people with, like,

5 ,000 friends?

Do they create a hot spot on your PubSub server?

Thankfully, no.

It's a great question.

But the load distributes really well.

While the publisher's channel lives on one PubSub server, the subscribers, all 5 ,000 of them, are scattered across hundreds of different WebSocket servers all over the world.

The load is naturally distributed.

Okay, for extra credit, then.

What if we wanted to extend the feature to show nearby random people who aren't friends a discovery feature?

For that, we'd need to use a geospatial indexing system.

The solution here is GeoHash.

GeoHash basically divides the entire Earth's surface into a grid of little boxes, each represented by a short string.

We can then assign a unique PubSub channel to each of these grid boxes.

The workflow changes a bit.

A user updates their location, and their WS handler calculates their current GeoHash ID, and publishes the update to the grid's channel, not their personal user channel.

Precisely.

And the crucial detail for accuracy is that clients subscribe not just to their current GeoHash channel, but also to the eight surrounding GeoHash grids.

This handles all the border cases, making sure you see people right on the edge of your grid.

It guarantees full coverage.

Very clever.

So finally, before we wrap up, is there an alternative to this massive Redis PubSub cluster for routing?

There is.

A powerful, if somewhat niche one.

Erlang OTP.

Erlang SuperPower is its incredibly lightweight process model.

An Erlang process uses only about 300 bytes of memory, and crucially, zero CPU when it's idle.

This means we could actually model all 10 million active users as individual Erlang processes, right inside the application layer.

So wait, instead of an external Redis cluster with all its network latency, Erlang lets the application layer become the message broker itself.

That's the idea.

Subscription and routing become native features of the Erlang runtime.

It could potentially consolidate those 140 Redis servers into a much smaller, more specialized Erlang cluster.

A very different approach.

That is a fascinating alternative.

Okay, to recap this massive design, we solved the 14 million updates per second problem using stateful websockets for connections, a sharded Redis cache with TTL for active users, and a massive distributed Redis PubSub cluster as the core routing engine.

And we managed that complexity with consistent hatching and service discovery, and made the whole thing efficient by doing the distance calculation after the fan out, but before sending anything to the mobile client.

A perfect summary.

So if we connect this back to the bigger picture, what's the key takeaway for someone designing a system like this?

I think the biggest lesson in designing high -scale real -time systems is understanding where the load ultimately shifts.

In this design, we moved the burden almost completely away from slow database reads and concentrated it into an ultra -fast, massive message routing system.

In the end, solving the PubSub bottleneck was solving the entire problem.

A perfect demonstration of how your architectural choices dictate both performance and your biggest operational challenges.

Thank you for joining us for this deep dive into real -time location routing.

We'll see you next time.

β“˜ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Designing a backend system for real-time location sharing among friends requires fundamentally different architectural decisions than handling static geographic data. A nearby friends feature must process location updates from millions of concurrent users, forward proximity notifications instantly to relevant peers, and tolerate occasional data loss while maintaining eventual consistency. At a 10 million concurrent user scale, the system must absorb roughly 334,000 incoming location updates per second while broadcasting approximately 14 million outbound notifications. A stateless layer handles auxiliary operations like profile management and friend relationships through standard REST endpoints, while persistent WebSocket connections enable bidirectional communication between mobile clients and backend infrastructure. Location data flows through multiple storage tiers: a Redis cache with automatic expiration purges inactive users in real-time, while a Cassandra database optimized for high write throughput maintains historical location records partitioned by user identifier. The core routing mechanism leverages Redis Pub/Sub as a lightweight message broker where each user owns a dedicated channel. When a user's location changes, the publishing server announces this update to the corresponding channel, and all online friends subscribed to that channel receive the notification, compute Euclidean distance locally, and deliver only updates within their specified search radius to their clients. CPU utilization, not memory consumption, becomes the critical bottleneck at this message volume, necessitating a distributed Redis cluster orchestrated through service discovery tools that maintain ring-based sharding organized by publisher identity. An alternative architecture employs Erlang on the BEAM virtual machine to model each active user as an independent lightweight process, better suited for handling extreme concurrency than traditional threaded models. The system can extend to support serendipitous discovery features by organizing Pub/Sub channels using geohash-based geographic grids. Load balancers distribute incoming connections across multiple WebSocket servers, while a consistent hash ring ensures scalability by mapping users deterministically across the cluster without requiring complete rebalancing during node changes.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML β™₯