Chapter 10: Real-time Gaming Leaderboard Design

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive, the shortcut you need to grasp the most complex system design challenges.

We take a stack of dense technical documents and distill the crucial decisions, trade -offs, and the surprising facts so you are instantly well informed.

Today we're embarking on a pretty fascinating deep dive into latency and scale.

We're going to be designing a system from the ground up, a real -time gaming leaderboard for a mobile game with millions of active users.

And when you first hear leaderboard, it sounds straightforward, just a sorted list.

But the moment you introduce those constraints, like real -time updates and massive scale, that simple sorting problem becomes a full -fledged distributed system headache.

It really does.

You need to process score updates instantly, rank competitors globally, and then serve those results back with just minimal latency.

So our mission today is to walk you through the entire blueprint.

We'll start with the fundamental requirements, move through the architecture, and then confront why traditional databases just fail at this.

And ultimately,

we'll tackle the scaling strategies needed to support a global player base.

Let's start with the basics, then, because every design begins with defining the rules.

What are we actually building?

Well, our core rules are simple.

Players earn points for winning a match, and here's the immediate challenge.

We are including all players in the ranking.

All of them.

So if we have 25 million monthly active users, we have to rank all 25 million of them, not a small subset.

OK, so what does the system functionally have to deliver?

You have three primary functional requirements.

The first is the baseline, display the top 10 players globally, simple enough.

The second is crucial for player engagement,

show a specific user's current global rank.

And third, a bonus requirement that complicates things a bit later on.

Now, what's that?

We need to display the players four places above and four places below that user to give them some immediate context.

Ah, OK.

The functional requirements set the stage, but it's the non -functional ones that really dictate the technology.

And we've already mentioned the most important one.

Yes.

The constraint that breaks traditional design.

Real -time updates.

Scores have to update in real -time, and that change has to be reflected instantly on the leaderboard.

Instantly.

We also need high availability, reliability, and immense scalability.

I mean, if the ranking is delayed by even 10 seconds, players will feel cheated, or they'll just think the system is broken.

To really understand the pressure here, we need to run the numbers.

Let's size this thing up.

We're starting with, what, five million daily active users?

Five million DAU, yeah, and 25 million monthly active users, MAU.

And if we average the score updates over 24 hours, it seems kind of manageable.

Maybe 50 users scoring points per second.

But that's totally unrealistic for gaming.

Of course.

Games have massive peaks, evenings, weekends, holidays.

We have to design for the worst -case scenario.

Exactly.

We assume peak load is five times the average.

So that pushes our peak user load scoring points to 250 users per second.

Now let's calculate the QPS, the queries per second, for those score updates.

Right.

Assuming a player averages, say, 10 matches per day, we take those 50 average users per second, multiply by 10 matches, and then multiply by that 5x peak factor.

And that gives us a peak write load of 2 ,500 score updates per second.

Yep, that's the number.

That is a target number the entire architecture has to handle consistently, without failure or latency.

In contrast, fetching the leaderboard is, well, it's pretty lightweight.

We estimate maybe 50 QPS for fetching the top 10.

Because users only check it periodically.

Maybe once when they open the game, so the real pressure is on the writes and that subsequent sorting.

Okay, 2 ,500 score updates per second.

That high -pressure target leads us directly into step two, proposing the architecture and defining the APIs.

We need three fundamental APIs.

First, an internal one used by our servers to post a score.

Right.

POSD V1 schools, which takes the user to the points.

Then two external get APIs for the client.

Get V1 scores to fish the top 10.

And get dv1scores .usered to fetch a specific user's rank.

Let's trace the flow.

A player wins a match.

What happens next?

So the client, the player's device, sends the match results to the game service.

This service running on our server validates the win.

And this is a really crucial security step.

You can't trust the client.

Never.

Once it's validated, the game service calls the leaderboard service to update the score.

So the game service is the authority.

That decision ensures security, right?

The client is never, ever allowed to dictate its own score.

Precisely.

If we allowed the client to set the score, any malicious user could just intercept the network traffic with a man in the middle attack and inflate their points.

And that would destroy the integrity of the whole leaderboard.

Instantly.

Server -side validation is just non -negotiable.

Did we think about putting a message queue, something of Kafka, between the game service and the leaderboard service?

You know, decoupling services is often a best practice.

It was definitely considered.

A message queue would be perfect if we had other downstream consumers, like an analytics platform that needs real -time score data.

Or a push notification service or something?

Right, to alert friends when you climb the ranks.

But since those weren't explicitly in scope for the initial design, we decided to keep the architecture simpler.

A direct sync call.

It introduces some technical debt if we expand, but it simplifies things for now.

Exactly.

It gets the core requirements done.

Okay, that makes sense.

Now we hit the absolute core decision.

The beta store.

This is where most traditional designs just completely fall apart.

Let's start with the simplest approach.

A relational database.

Like MySQL.

Yeah, the initial thought is always to create a simple table,

user -id map to score.

You just update the score when a player wins.

Simple inserts, simple updates, it seems fine.

But what is the fatal flaw when you have 25 million users and you need those real -time rankings?

The flaw is the rank calculation itself.

To find the top 10, or specific users' rank, the database has to execute an order -by -score DSC query across the entire table.

Across all 25 million rows.

Exactly.

And that sort of operation can take tens of seconds, which is way too long to meet our real -time requirement.

Relational databases are just not optimized for this.

Not for maintaining an ordered view during high -volume writes.

So we throw out the generalized tool and bring in the specialized one.

The one designed for this high -performance task, Redis Sorted Sets.

Right.

Redis is an in -memory store, so it delivers that low -latency performance we need.

And the Sorted Set, or ZSAT set, is just tailor -made for leaderboards.

It uniquely associates a member or user ID with a score.

And crucially, it automatically keeps the entire set sorted.

Automatically.

That's the key.

So what's the technical mechanism inside the Sorted Set that lets it maintain that order so efficiently compared to, say, a traditional SQL index?

Well, the ZSAT uses two structures together.

First, a simple hash table maps the user ID directly to the score.

That allows for lightning -fast score lookups.

But the real magic is the second structure, the skip list.

The skip list.

It sounds like that's what's doing the heavy lifting of the real -time sorting.

How does it achieve that level of performance?

So imagine a standard linked list.

To find an entry, you have to look at every single node, one by one.

That's linear time complexity.

Right.

A skip list adds multi -level indexes on top of that.

Think of them like express lanes on a highway.

The topmost index might only contain one node for every 16 nodes at the base level.

So instead of manually traversing every single player, you can use those index levels to jump.

To jump quickly to the correct rank vicinity.

Precisely.

This structure means insertion, removal, and searching, including our rank calculation, all achieve logarithmic time complexity.

Oh, lot.

Which is incredibly fast and predictable no matter if you have one million or one hundred million users.

That's the power of it.

So let's look at the commands.

If I win a point, how does Redis handle that?

We use the ZNCRBY command.

For example, ZNCRBY leaderboard FEVS 2021 win user alpha.

This increments the user score and the skip list instantaneously and efficiently shuffles that entry to its new correct rank.

And fetching the data we need.

The top 10 or a specific user's rank.

Fetching the top 10 is a rev range.

The reverse range command asking for indexes is 0 to 9.

And for that critical requirement of fetching a user's specific rank,

we use 0 of rank.

And both of those are also O login.

Both are O login.

We meet the real time constraint easily.

What about that bonus requirement, fetching the four players above and four below?

That's actually handled pretty simply by just combining the two.

We first use 0 of rank to find the user's position.

Let's say it's index 500.

Okay.

Then we just use 0 of range to pull the range of data from index 496 to 504.

The data structure supports this kind of slicing operation perfectly.

Before we jump into massive scaling, let's just address feasibility.

Our 25 million monthly active users, is that a financial burden on memory?

Not at all, actually.

We estimate each entry requires roughly 26 bytes of storage.

For 25 million users, that's only about 650 megabytes.

That's it.

Yeah.

Even if we factor in the overhead of the hash table and the skip list structures, doubling it, we're still only talking about 1 .3 gigabytes of memory.

So a single modern Redis server can handle our initial scale.

More than enough to handle both the storage and our 2 ,500 QPS load.

That makes the initial design really robust.

Now, step three, the big leagues.

What happens when our game explodes and we need to scale Redis to 500 million DAU, a hundred times the original scale?

Okay.

Now we have a major problem.

We're looking at 65 gigabytes of storage and needing to handle 250 ,000 QPS.

This absolutely necessitates data sharding.

Breaking that single Redis set across multiple servers?

No other way.

And we have two primary sharding strategies for this.

Let's look at fixed partitioning first.

Fixed partitioning shards, the data based on score range.

So for example, server one handles players with scores from one to 100, server two handles one to one to 200, and so on.

Okay.

So the application layer has to handle all the logic.

All of it.

It has to know which shard a user is in and has to move the user to a new shard when their score crosses a boundary.

That sounds like a lot of administrative overhead.

What's the benefit that makes it worth it?

The benefit is huge for retrieval.

The top 10 players will almost certainly live in the highest score range shard.

That means you only have to query one shard to fetch the global top.

Ah, I see.

And fetching a specific user's global rank is also easier.

Relatively straightforward.

Yeah.

You find their local rank within their shard, then you just add the total player counts of all the higher scoring shards above it.

Okay.

Now for the alternative.

Hash partitioning, which is typically implemented using a Redis cluster.

Right.

Redis cluster automates this process using something called hash slots.

It maps the user ID key to one of over 16 ,000 slots and then distributes those slots across the cluster nodes.

It's great for scaling reads and writes linearly.

Because the load is evenly distributed.

Exactly.

But what is the major operational trade -off for leaderboards?

The problem comes right back to that global rank.

Since the data is distributed based on user ID, which is random, the top 10 players are scattered across maybe hundreds of different shards.

To get the global top 10, you have to use a scatter -gather approach.

Okay.

Describe that scatter -gather process and the latency hit we take.

You send a request to every single shard in the cluster.

That's the scatter asking for its local top 10.

Then the application server receives all those lists back, potentially hundreds of them, and has to sort the combined results to find the true global ranking.

That's the gather.

That's manageable for the top 10, maybe.

But determining a specific user's global rank sounds like a nightmare.

You'd have to do that whole scatter -gather every time.

Every single time.

It adds significant read latency and a whole lot of complexity.

It sounds like for a leaderboard where specific rank retrieval is a core requirement, the administrative headache of fixed partitioning might actually be better than the latency of constant scatter -gather in a hash partition cluster.

That is the crucial trade -off.

For performance, fixed partitioning is often the preferred path for leaderboards, despite the complexity of managing those score boundaries as the game evolves.

We also looked at a complete alternative to Redis, a NoSQL solution using DynamoDB as the example.

How would we manage real -time ranking there?

Well, since DynamoDB is optimized for high writes, we'd need a specialized indexing technique.

We'd use a global secondary index, a GSI.

You could think of a GSI as a completely separate, pre -sorted table that DynamoDB maintains for you, optimized just for querying based on score.

And to handle our 2 ,500 QPS write load, how do we stop that single monthly leaderboard from becoming a hot partition that just overwhelms one server?

We'd have to implement write sharding.

We don't just use the month as the partition key.

We'd engineer a composite key, something like game name hashtag your month hashtag partition number.

We'd deliberately split the data across end partitions.

By appending a random partition number to the key?

Right, forcing parallel writes across multiple nodes.

So we've successfully prevented the hot partition, but now we're back to the exact same problem we had with Redis Cluster.

Exactly.

Since the data is split across end partitions, finding the global top 10 still requires a scatter -gather operation across all of them.

It just shows that scaling a global rank is fundamentally a scatter -gather problem unless you manually manage the score ranges, like in fixed partitioning.

Before we wrap, let's cover two essential topics, deployment and failure recovery.

For deployment, we had the choice between managing our own setup or going serverless.

If we were building this today, the serverless approach using Amazon API Gateway triggering AWS Lambda functions is highly recommended.

Because it auto scales.

The Lambda functions automatically scale to meet our 2500 QPS peak traffic without any manual provisioning.

The Lambda just calls Redis for rank data and MySQL for the persistent user details.

And now the part everyone overlooks until disaster strikes, system failure recovery.

If our Redis Cluster crashes or loses its in -memory data, how do we rebuild?

This is exactly why we didn't completely get rid of a traditional database.

We maintain a persistent MySQL or PostgreSQL database, which records every single score update event with a timestamp.

This is our source of truth.

Our audit log.

If Redis fails, we perform an offline recovery.

We read through all those persistent game win entries in order and sequentially call the zincrby command for every single win to perfectly recreate the state of the sorted set.

That's such a critical link.

The slower persistent database ensures the system is recoverable, even if the fast ephemeral Redis cache is lost.

Okay, let's finish with two final advanced topics.

Optimization for display and tie breaking.

For optimization, right, displaying the top 10 requires more than just IDs and scores.

We need profile pictures, names, all that.

To avoid constant slow lookups to the MySQL database for this display data, we should maintain a separate, simple Redis hash.

This just maps the user ID to the full user profile object, ensuring minimal latency when you're rendering the leaderboard screen.

And finally, what about ties?

How do we break a tie when two users have the exact same score?

We enforce the rule that the user who achieved that score first ranks higher.

We accomplish this by storing the timestamp of the most recent win alongside the score.

In Redis sorted sets, the tiebreaker is often based on the member name.

But using timestamps ensures the user with the older or earlier timestamp ranks higher.

It rewards chronology.

So we've covered the complete journey, starting from a high pressure 2500 score updates per second requirement.

We rejected MySQL because it just can't scale for real time ranking.

We adopted Redis sorted sets thanks to the olog and efficiency of the skip list.

And ultimately, we grappled with those complex tradeoffs between fixed partitioning and hash partitioning when scaling to hundreds of millions of users.

The high level takeaway is really that the real time requirement forces you away from general purpose tools and toward these specialized data structures.

The power of Redis sorted sets to maintain order during writes while giving you fast rank lookups makes this entire design feasible.

But when you shard, you're trading off administrative simplicity for retrieval complexity.

Especially when checking a specific user's global rank.

That's the main pain point.

It really makes you appreciate that even the simplest feature in a mobile game is backed by a highly optimized distributed system built on some very carefully chosen tradeoffs.

Now, here's something for you to think about.

The efficiency of the Redis sorted set is perfect for ranking based on a single score.

Think about what other common game features like dynamically tracking achievements or ranking friend lists could leverage that same low latency ordered structure without forcing you to manually sort data every time.

It's a fascinating area of design and a really solid foundation for tackling any low latency system.

Thank you for diving in with us.

We'll get you next time on the deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Designing a real-time gaming leaderboard for millions of concurrent users requires careful architectural decisions that balance scalability, latency, and consistency. The system must satisfy multiple functional demands including displaying top-ranked players, retrieving an individual user's rank position, and processing continuous score updates without delay. At scale, with five million daily active users generating approximately 2,500 score update requests per second at peak hours, traditional relational database approaches quickly become untenable because rank computation demands expensive sorting operations across massive datasets, resulting in unacceptable query latencies. Redis emerges as the optimal solution, specifically through its sorted set data structure which maintains leaderboard information in sorted order while providing logarithmic-time insertion and ranking operations. Internally, Redis sorted sets combine hash tables for element lookup with skip lists that enable efficient range queries, allowing operations like ZINCRBY for score increments, ZREVRANGE for top-player retrieval, and ZREVRANK for rank lookups to execute with consistent performance. The architectural design separates concerns between a Game Service responsible for validating wins and a Leaderboard Service that applies updates through asynchronous internal API calls, preventing client-side score manipulation attacks. For infrastructure deployment, choices range from self-managed MySQL and Redis instances to cloud-native serverless options using AWS Lambda and API Gateway, which provide automatic scaling without operational overhead. As systems grow beyond initial scale constraints, data sharding becomes necessary, introducing two partition strategies with distinct tradeoffs: fixed partition by score range enables efficient rank retrieval but requires careful rebalancing, while hash partition using Redis Cluster distributes load uniformly across 16384 slots at the cost of requiring scatter-gather operations to compute global top rankings. Alternative approaches employ NoSQL databases like Amazon DynamoDB with Global Secondary Indexes and write sharding techniques to avoid partition hotspots on frequently updated leaderboard entries. Secondary concerns including tie-breaking and failure recovery demand additional mechanisms such as storing recent win timestamps to establish consistent ordering and maintaining historical snapshots in persistent storage for offline leaderboard reconstruction.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 10: Real-time Gaming Leaderboard Design

Related Chapters