Chapter 4: Design a Rate Limiter

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are undertaking a really critical deep dive into something that's at the very foundation of system stability, especially now in the API economy.

Rate limiting.

I mean, if you're building any application that has to interact with users or other services at scale,

it's just not optional anymore, it's mandatory.

We're looking at how these massive distributed systems, the Twitters, the Googles, the Amazons of the world, how they manage this just huge, unpredictable flood of requests coming in every single second.

That's right.

And when you strip away all the complexity, a rate limiter is really just a sophisticated traffic controller.

Its whole mission is to define and enforce a specific threshold, say five requests per second for a user or maybe 10 login attempts per minute from one IP.

And if the flow exceeds that threshold, the extra calls just get blocked.

So let's establish that mission right away.

You outlined, I think, three core reasons why a rate limiter is basically the first line of defense in any modern system design.

What are they?

Yeah, they really boil down to defense, cost, and stability.

Okay, defense first.

So first, defense.

We have to prevent resource starvation, and that can be caused by a malicious denial of service or DOS attack.

Or it could just be an accidental flood of traffic from, say, a buggy client.

Rate limits, just make sure those core services stay available for everyone else.

And that second point, it really hits the business side, which I think a lot of people forget in these technical discussions.

Absolutely.

Cost management is huge.

I mean, if your app relies on paid third -party APIs, like payment processing or complex AI services, you get charged per call.

Per call, yeah.

An unrestricted client could easily rack up thousands, even hundreds of thousands of dollars in fees.

The rate limiter is basically your budget protector.

And finally, it's about preventing the system from just collapsing on itself.

That's the stability goal.

Precisely.

It keeps your own servers from buckling under the load, from misbehaving bots or just weird user behavior.

It ensures that the really high -priority APIs, like, say, placing an order, always have the resources they need.

Even if that same user is just spamming the Like button 100 times.

Exactly.

So for this deep dive,

our mission is to design a highly available server -side API rate limiter.

It's got to work in a distributed environment for a massive user base.

And the rules need to be super flexible, IP address, user ID, you name it.

And the technical demands are?

Yeah.

They're really stringent.

I mean, this is where the difficulty lies.

The system has to have super low latency.

The check itself can't slow down the HTTP response time.

Can't be the bottleneck.

Can't be the bottleneck.

Yeah.

It has to be memory -efficient, fault -tolerant.

Yeah.

So if your cache server fails, the whole system doesn't just grind through a halt.

And crucially, it must work across hundreds of servers.

Okay, so let's talk placement.

Where do we actually put this gatekeeper?

Well, we can discard the client -side immediately.

Right.

Any implementation on the client is just.

It's inherently unreliable.

Requests can be forged.

They can be manipulated.

It has to be server -side.

On the server -side.

The standard pattern is to place it as a middleware component, or more commonly these days with microservices, you'd find it inside an API gateway.

So the moment of truth happens right there at the gate.

Walk us through that request flow.

So the middleware intercepts every single incoming request.

It checks the counter.

And if the client is within their limit, let's say it's two requests per second, the request just proceeds to the API servers.

But what if they send a third?

That third request gets intercepted and rejected immediately.

It returns the standard HTTP status code 429, which just means too many requests.

Okay, so this brings up a huge architectural trade -off.

Do we build this ourselves from scratch, or do we use an existing feature in an API gateway, like something from AWS or Kong?

Yeah, that really depends on your existing tech stack and your engineering resources, frankly.

If you're already using an API gateway for routing and authentication, just adding rate limiting there is often the fastest way to go.

But if your business needs some really complex proprietary logic -like, maybe you prioritize paying users over free users.

Building your own custom solution gives you that granular control.

Just remember that custom solutions mean a lot of ongoing maintenance.

Okay, now let's get to the heart of it, the algorithms.

These are the engines that do the counting.

We've got five key players starting with the token bucket, which is favored by Amazon and Stripe, right?

Yeah, the token bucket is pretty intuitive.

Just picture a bucket with a fixed capacity.

Tokens are added to this bucket at a fixed refill rate.

Okay.

Every request that comes in needs to grab one token to proceed.

If the bucket's empty, well, the request is dropped.

And what makes it so popular is its ability to handle bursts of traffic, right?

So if the system's been quiet, the bucket is full and a user can suddenly just blast through all those saved up tokens at once.

Exactly.

It manages smoothness with the refill rate, but it gracefully handles short -term bursts, which is super common during a log -in or a flash sale.

But there's a catch.

There is.

The challenge is parameter tuning.

You have two variables, the bucket size and the refill rate.

And tuning them can be extremely difficult.

If the bucket's too big, you risk overwhelming your system.

If it's too small, you start dropping valid traffic.

And we also have to remember that a single user can have multiple different limits on them at the same time.

Oh, absolutely.

That's the implementation reality.

One user might have a limit of one post per second and 150 connection requests per day and five likes per second.

That one user needs three totally separate independent token buckets.

Wow.

Okay, so moving on from that, we have the leaking bucket.

Similar name, but it serves a very different purpose.

It's almost the conceptual opposite.

The leaking bucket prioritizes smoothing out traffic over allowing bursts.

It acts like a fixed size queue.

Requests are added to the bucket.

And if that queue is full, new requests are just dropped.

So the leak is the outflow rate.

Exactly.

Requests are pulled out of that queue and processed at a constant fixed interval.

The output rate is totally stable, which is great for backend systems that prefer a steady diet of requests.

But it sounds like there's a downside.

A pretty significant one.

A burst of traffic can fill that queue with older stale requests.

So a new, maybe critical request might get stuck behind all that old traffic or just dropped entirely.

It can feel pretty unfair.

Okay, let's pivot to the counter -based methods.

These are more about accuracy over time.

And first up is the simple, but I think highly flawed fixed window counter.

Yeah, it's the simplest idea.

You just divide time into fixed windows, say one minute, and you count all the requests in that window.

If the counter hits the threshold, you just drop requests until the next minute starts.

It's super easy to implement, uses almost no memory.

But this is where production reality bites.

What is the critical flaw here that makes this so risky to use at scale?

It's the edge spike problem.

The edge spike.

Imagine your limit is five requests per minute.

A user could send five requests right at, say, 2 .00 .59.

At the very, very end of the window.

Right at the end.

Then the counter resets at exactly 2 .01 .00, and they can immediately send five more requests.

So over that rolling 60 -segment period, they've actually pushed through 10 requests,

double their quota.

I see.

And that could just hammer your servers right at that window boundary.

It absolutely can.

So that problem leads us directly to the fix, the sliding window log.

This is the gold standard for accuracy.

Instead of just incrementing a counter, we store a timestamp for every single request that comes in.

Every single one.

Every single one.

Typically in a high -speed cache, using something like red assorted sets, when a new request arrives, the system first looks at the log.

It removes any timestamp that's now outside the current rolling window, and then it counts what's left.

So it guarantees perfect accuracy across any rolling window, but if you're logging every single request, the memory consumption has to be terrifying.

It is.

That's the primary trade -off.

For an API getting millions of requests per minute, storing a timestamp for every one of them becomes this massive memory burden.

It's expensive, and it's slower.

So system architects needed something that was accurate, like the log, but efficient, like the counter.

And that's where the hybrid comes in, the sliding window counter.

This one is a really beautiful compromise.

It uses fixed windows, but it approximates the count to smooth out that edge spike,

all while saving memory.

Okay, so how does it do that?

Let's say your limit is 100 requests per minute.

A request arrives 30 seconds into the current minute.

So we're 50 % of the way through the window.

Now we need to figure out how much traffic from the previous minute is still relevant to this current 60 -second rolling window.

Precisely.

The calculation approximates this.

You take the full count of requests in the current minute, and then you add a percentage of the count from the previous minute.

The overlap percentage.

The overlap percentage, exactly.

The formula is basically current window count plus the previous window count times that overlap percentage.

It sounds a little complex, but it delivers the efficiency of the counter.

Is that approximation reliable in the real world?

Yeah, studies, including some from Cloudflare, have shown it works really well, because traffic is usually distributed smoothly enough that the error rate is tiny.

You get the stability you need without the catastrophic memory cost.

This raises a big structural question.

We're talking about all these counters and logs.

Where do we actually store this data?

Why not just use a traditional database?

A database is just fundamentally too slow for this.

You need something that can respond in milliseconds for every single request.

That's why an in -memory cache, like Redis, is the industry standard.

Yes, it's fast.

It's fast, and crucially, it supports time -based expiration.

We use two key commands,

incr to increment the counter, and expire to automatically set a time to live.

That ensures the counter gets deleted at the exact moment the window rolls over.

And the rules for these counters can get incredibly detailed, right?

We're not just limiting by user ID.

Oh, no.

The rule set defines the granularity.

You might have a rule for a domain, like messaging, and then a key, like message -type .marketing, with a limit of five per day.

Or you could limit attempts on an endpoint, like off .login, to five per minute.

These rules are usually stored in a config file or a high -speed cache so you can update them easily.

Now, when a client is blocked, they need to know what to do.

The response headers become absolutely vital here, right?

It's more than just that 429 status code.

Yeah, this is so important for good client behavior.

The 429 response has to include specific headers.

There's X rate limit limit, which is the total quota, X rate limit remaining, how many calls they have left, and most importantly, X rate limit retry after.

And that last one tells them how long to wait.

Exactly.

It tells the client exactly how many seconds they need to wait before trying again.

If a client ignores this, they can create what's called the thundering herd, just hammering the server over and over, making everything worse.

Okay, let's talk about the big picture challenge.

All this is fine for one server, but when you have a distributed environment with hundreds of servers, you run into two massive problems, race conditions and synchronization.

It's fascinating how microseconds can just break your logic.

The race condition happens when you do that three -step operation.

One, read the counter.

Two, check the threshold.

Three, increment the counter.

So if the counter is at, say, three, and two concurrent requests read three at the exact same instant, they both decide to proceed and then they both write four back.

You've just lost a count and let an extra request through.

Exactly.

And if you try to use traditional locks to prevent this, you'd slow the whole system down.

It would kill your latency.

That's why we rely on atomic operations.

How does that work?

We use tools like Redis Lua scripts.

Lua lets you package all three steps, the read, the check and the write, into a single unbreakable transaction that the Redis server executes all at once, without any interruption.

It completely solves the race condition.

And the synchronization issue.

That's about making sure all the different rate limiter servers have the same up -to -date count for every user, right?

Yeah, if a client talks to server X and then their very next request goes to server Y, both X and Y need to know the combined total.

Solutions like sticky sessions, which try to force a user to always hit the same server, just don't scale.

So you need a central source of truth.

You must have a centralized data store.

Redis, again, is the answer.

It acts as that single source of truth for all the counter data, ensuring you get consistent decisions across the entire fleet.

Okay, but consistency is one thing.

What about raw performance for a global audience?

How do you optimize latency for users who are on the other side of the world from your main data center?

We rely on multi -data center setups and edge servers.

Think of Cloudflare's infrastructure.

By distributing the rate limiter itself to edge servers that are geographically closer to the users, you route traffic to the nearest point and drastically cut down the latency of the check.

But then you have to synchronize data between those data centers.

Right, and for that we typically rely on an eventual consistency model.

Absolute consistency would just be too slow.

It does introduce a small trade -off though.

Which is?

There might be a tiny window where a user's counter is slightly out of sync between continents, meaning they might be slightly over or under -throttled for a moment.

But that small delay is usually considered an acceptable price for the massive latency improvement.

We've built it, we've solved the distributed problems, we've optimized it, now we have to monitor it.

What are we measuring to make sure it's actually working?

You're monitoring two things, rule effectiveness and algorithm effectiveness.

If your monitoring shows a ton of legitimate requests are getting dropped, your rules are probably too strict and you need to relax them.

And for the algorithm?

If you see unexpected spikes still overwhelming the system, your algorithm might not be handling bursts well.

That might force you to switch, maybe from a sliding window counter to a token bucket or just to tune the bucket size and refill rate.

Monitoring is that essential feedback loop.

This has been a really, really technical deep dive into system stability.

We covered the five big algorithms.

The bursty token bucket, the steady leaking bucket, the simple but flawed fixed window counter and then the two fixes for that.

The super accurate but memory heavy sliding window log and the efficient hybrid, the sliding window counter.

And remember, rate limiting goes even deeper than what we talked about.

You can look into hard versus soft rate limiting.

One is a strict cap.

The other allows temporary overages.

You can also do it at different layers of the network stack.

We focus on the application layer, but it's often done at the IP layer with firewall rules.

Right.

And clients have a role to play too.

Using client side caching and proper back off logic is so important to avoid making the problem worse when you do get throttled.

So what does this all mean for you?

If you were designing a system for a site known for highly unpredictable traffic spikes, like say ticket sales for a huge concert,

which algorithms trade -offs that balance of accuracy versus memory and speed would you ultimately prioritize?

The answer involves some pretty serious trade -off evaluation.

Indeed.

Thank you for joining us for the deep dive.

We'll catch you next time.

β“˜ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Rate limiting serves as a critical traffic control mechanism in distributed systems, protecting backend infrastructure from resource exhaustion caused by malicious denial of service attacks, accidental traffic surges, or bot activity while simultaneously reducing operational expenses through selective request filtering. Server-side implementations, typically deployed as middleware components within API gateways in microservice architectures, intercept incoming requests before they reach application servers, making them far more reliable than client-side approaches. The core challenge in rate limiter design involves managing high request volumes across geographically distributed systems while maintaining minimal latency and memory footprint. Five distinct algorithmic approaches address this challenge with varying tradeoffs. The Token Bucket algorithm provides flexibility by accumulating permission tokens at a fixed rate, permitting traffic bursts when the token pool remains full, making it ideal for variable workload patterns. The Leaking Bucket algorithm enforces strict request processing rates by maintaining a FIFO queue that drains at constant speed regardless of input volume. The Fixed Window Counter approach tracks request counts within discrete time intervals but suffers from edge-case vulnerabilities where traffic concentrated at window boundaries can violate intended quotas. The Sliding Window Log algorithm delivers accuracy by recording precise timestamps for each request but demands substantial memory resources proportional to traffic volume. The Sliding Window Counter algorithm combines efficiency with reasonable precision by calculating rates using the current window's count plus a weighted overlap from the previous window, effectively smoothing traffic patterns. Implementation in production systems typically leverages Redis or similar in-memory stores for counter persistence and expiration management, avoiding disk I/O penalties. Distributed deployments introduce synchronization challenges across multiple rate limiter instances, solved through centralized data stores, while concurrent counter increments require atomic operations via techniques like Lua scripting to prevent race conditions. Rejected requests receive HTTP 429 status codes alongside informative headers such as X-Ratelimit-Remaining and X-Ratelimit-Retry-After, guiding client retry behavior. Performance optimization strategies include deploying edge servers across multiple data centers to minimize latency for geographically dispersed users and adopting eventual consistency models for state propagation across distributed infrastructure.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML β™₯