Chapter 4: Distributed Message Queue Design
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
All right, let's dive right in.
So,
if you're building any kind of modern large -scale application, you know, an e -commerce platform, a social media feed, you pretty quickly find out having all your components tightly connected is a recipe for disaster.
It really is.
One little failure can bring the whole system down.
So, today we're tackling the system that fixes that exact problem.
We're going to try and design a high -performing distributed message queue,
the sort of communication backbone of modern software.
It's basically the architectural glue.
When you break systems down into all these little microservices or independent blocks, the message queue is that crucial layer that lets them all coordinate.
It lets them talk to each other without even knowing the other exists.
Exactly.
So, what are the, say, four big architectural wins you get from putting an MQ in your system?
Okay, they're really powerful, but also pretty straightforward.
First up is decoupling.
Your services are totally independent.
You can update one without breaking anything up.
Okay, that's huge.
What's number two?
Improved scalability.
Producers and consumers can scale on their own.
So, if you get a big traffic surge, you just spin up more consumers to handle the load.
Got it.
Third is increased availability.
If one of your consumers crashes, it's fine.
The producer can just keep writing messages to the queue.
The system stays up.
And the last one is performance.
Better performance, yeah.
And this is the big one.
It makes asynchronous communication possible.
It means your producers don't have to sit around and wait for slow consumers.
They just drop the message and move on.
That really sets the mission for us then, because traditionally you had message queues over here and then event platforms over there, but now the lines are blurring.
They're totally converging.
So, our design has to be a modern, robust,
distributed MQ that also has some of those streaming features like high durability and letting you consume messages more than once, and crucially, guaranteed ordering.
So, let's nail down those requirements.
Functionally, what are we looking at?
We're talking text messages, probably in the kilobyte range.
We need to support repeated consumption,
keep the data around for at least two weeks, and this is critical.
We have to guarantee stripped message ordering inside our structure.
And what about that really complex one, the configurable data delivery semantics?
Right.
We have to support at least once delivery at most once, and ideally the really tough one, exactly once.
And non -functionally.
It's got to be distributed and scalable so it can handle sudden traffic spikes.
It needs to be persistent, durable across multiple machines, and this is a cool feature.
Clients should be able to tune it for either high throughput or for low latency.
Okay.
That really sets the stage.
So, let's move on to step two.
The high -level design, I'm picturing the basic flow.
Producers send messages, the MQ is in the middle, and consumers subscribe and pull messages out.
Yep.
That's the basic loop.
And to organize that flow, we're definitely going with the
Why PubSub specifically?
Well, because our requirements demand that two -week persistence and repeated consumption.
PubSub is perfect for that.
Messages get sent to something called a topic, and then any client that subscribed can consume them.
The topic is just, you know, a category name.
Okay.
So, a single topic is great, but how do you scale that beyond what one server can handle?
How do you shard it?
We use partitioning.
We just shard the topic into these smaller units called partitions.
And a partition is the key to parallelism here.
It's the smallest unit of parallelism, yes.
And critically,
inside each partition, it maintains strict FIFO first -in, first -out order.
So, if you're a consumer, how do you keep track of where you are in that stream of messages?
That's done with an offset.
Every single message inside a partition has its own unique sequential offset number.
And these partitions, they have to live on physical servers, which we call brokers.
Okay.
So, let's say a producer has, I don't know, a thousand messages all for the same user.
You don't want those scattered across different partitions, right?
That would break the ordering for that user's activity.
Exactly.
You want them all to land together.
So, the producer can use an optional message key, like the user ID.
You hash that key.
You hash that key, and the hash value determines which specific partition the message goes to.
That ensures all related data sticks together.
This is starting to sound a lot like a pure streaming platform, but you said we also need to support the old school point -to -point queue model.
How do you do that with topics and partitions?
This is the really clever part.
It's the concept of a consumer group.
A consumer group is just a set of consumers that are all working together on the same job.
So, how does that simulate both models?
If you have multiple different consumer groups all subscribing to the same topic, that's publish -subscribe.
But if you put all your consumers for a specific task into a single group, it acts like a point -to -point queue.
The system makes sure only one consumer in that group processes any given message.
Okay, but there's a really critical rule here, right?
A constraint you have to enforce to make sure that ordering is actually guaranteed inside the group.
Yes, and this is non -negotiable.
To guarantee that FIFO order, we enforce that a single partition can only be consumed by one consumer within the same consumer group.
Why is that so important?
Because if you had two consumers from the same group reading the same partition, they'd be fighting over who gets to commit the offset, and your message order would be destroyed instantly.
You sacrifice a bit of parallelism for that strict ordering guarantee.
So, looking at the architecture now, we have our clients, producers, and consumer groups.
They're talking to the brokers, and the brokers are doing all the heavy lifting, right?
Data storage, state storage for the offsets, and metadata storage for topic configs.
And you need a brain to manage all these distributed brokers.
That's where the coordination service comes in.
It's usually an external system, something highly consistent like ZooKeeper, etc.
And what does it do?
It handles service discovery so clients know which brokers are online, and it runs the critical leader election process to decide which broker is in charge of which partition.
Okay, let's get into the deep dive.
The big challenge seems to be getting high throughput and high data persistence.
How do we do that?
What are the key design choices?
There are three core decisions we make.
One, we use on -disk baby structures that are built for sequential access.
Two, we design our message format in a way that gets rid of expensive data copying.
And third, we use batching, aggressive batching everywhere we can to fight the overhead of lots of small IO operations.
Let's start with storage.
Our traffic is write heavy, read heavy, and it's all append only, so traditional databases are out.
Oh yeah, completely out.
They're optimized for random lookups and indexing, which is just massive overhead for us.
Instead, we lean on something called a write ahead log or wall.
And a wall is basically just a file you keep appending to.
That's all it is.
It's a series of append only files.
And the big insight here is that sequential disk IO, even on old school spinning hard drives, is incredibly fast.
We're talking hundreds of megabytes per second.
We choose sequential access over random access every time.
And you mentioned this log file is broken into segments.
How does that help with our two -week retention rule?
It makes management so much simpler.
New messages are only ever appended to the one active segment.
Once it fills up, it gets closed and becomes inactive.
And you can just delete the old ones.
Exactly.
The inactive segments are only for reading historical data.
So once a segment is older than two weeks, the broker can just instantly delete the entire file.
No complicated database cleanup needed.
Okay.
Second decision.
Eliminating data copying.
This means a strict contract for the message format.
What are the key fields in that structure?
Right.
The schema is everything.
You've got the message key for routing, the unique offset for tracking its position, and importantly, a CRC, a cyclic redundancy check for data integrity.
And by defining this structure, you can pass the raw bytes around without copying them into different memory buffers.
Precisely.
It can go straight from the producer's buffer to the network, and then from the network straight to the consumer's buffer.
It's super efficient.
Which leads right into batching.
You mentioned we build the routing logic and a buffer right into the producer -client library.
What's the trade -off a producer makes with its batch size?
It's the classic throughput versus latency battle.
If you use a large batch size, you can send a ton of data in one go.
That gives you really high throughput.
But the messages have to wait longer in the buffer.
Right.
So you get high latency.
A small batch size gives you low latency, but it's way less efficient overall.
The producer has to tune this for their specific use case.
Okay, let's flip over to the consumer side.
Yeah.
That brings up a classic question.
Yeah.
Do the brokers push data out to the consumers, or do the consumers pull it?
And most modern high throughput systems, including ours, land on the pull model.
Why is that?
Push seems like it would be lower latency.
It can be, marginally.
But pull is just so much better for scalability and being resilient.
The consumer controls its own rate, which prevents a slow consumer from getting overwhelmed by the broker.
But the downside of pull is that the
constantly asking, got anything for me?
And the broker says no.
Which is a problem, but it's solved neatly by using long pulling.
The pull request doesn't return right away.
It actually hangs open for a bit, waiting for new messages to arrive or for a timeout.
And the consumer just specifies the offset it wants to start from.
Yep.
It says, give me a batch of messages starting at offset X and the broker delivers.
Okay.
So once you have these consumers in a group, you need fault tolerance.
Let's talk about consumer rebalancing.
What happens when a consumer crashes or a new one joins?
This whole process is managed by a component called the group coordinator.
When a new consumer joins, the coordinator sees it, pauses everything, and triggers a rebalance.
It tells everyone to rejoin the group.
Exactly.
Once they're all back, the coordinator picks a leader from the consumers, and that leader creates a new partition dispatch plan.
So an explicit map of which consumer gets which partitions.
A perfect map.
Then consumption resumes.
The same thing happens if a consumer crashes and misses a heartbeat.
It gets marked as dead, and a rebalance is triggered instantly to reassign its partitions.
Let's talk about state metadata.
You need high consistency for things like partition assignments and, especially, the last consumed offset for each group.
You can't just have the brokers manage that themselves, can you?
No, you want to keep the brokers simple.
So we unify all of that, the coordination, the state storage, the metadata, into that external service we mentioned, like ZooKeeper.
Because ZooKeepers gives you those strong transactional guarantees.
It does.
It's perfect for safely storing the partition maps and the offsets, ensuring we never, ever lose a consumer's progress.
All right, on to durability.
Disks fail, brokers crash.
How do we make sure messages aren't lost?
Replication.
Simple as that.
Every partition has multiple replicas spread across different brokers.
You have one leader and several followers.
And producers only ever write to the leader.
Only to the leader.
The followers are constantly pulling data from the leader just to stay synchronized.
This brings us to a really interesting real -time trade -off, the in -sync replicas, or ISR.
Right.
The ISR is kind of like an exclusive club.
It's the set of replicas that are actually keeping up with the leader within some configured time limit.
If a follower gets too slow, maybe its disk is lagging, it gets temporarily kicked out of the ISR.
Why?
To prevent that one slow replica from bottlenecking the right performance for the entire partition, you're prioritizing speed, even if it means a tiny, temporary reduction in redundancy.
And this ISR concept directly leads to the most important trade -off for the producer, the Acknowledgement Setting, or ACK.
Let's walk through the three levels, starting with the safest, ACK or SA.
So with A -call, the producer has to wait.
It waits for its message to be saved by the leader and replicated to all the other members of the ISR before it gets a confirmation.
Super durable, but I'm guessing super slow.
The slowest, yeah.
You'd use this for, like, critical financial transactions where you absolutely cannot lose a single piece of data.
Okay, what's the middle ground?
ACK 1.
ACK 1 is much faster.
The producer gets its confirmation as soon as the leader alone has saved the message to its disk.
It doesn't wait for the followers.
So what's the risk there?
The risk is a small window where if the leader crashes right after it gets the message but before the followers can copy it, that message could be lost.
It's acceptable for a lot of low -latency apps where losing a rare event isn't the end of the world.
And finally, the speed demon.
ACK 0.
What use case is that for?
ACK 0, what use case is that for?
ACK 0 is just fire and forget.
The producer sends the message and moves on
No waiting for an ACK, no retries, latency is minimal.
This is perfect for things like high -volume metrics or logging where if you miss one beta point, the overall trend is still visible.
Okay, let's talk about scaling capacity.
Increasing partitions seems easy enough, but what about decreasing the number of partitions?
That is deceptively complicated.
You can decommission a partition, which means you stop sending new messages to it, but you cannot delete its data.
Why not?
Because that data has to stay available for the full two -week retention period, just in case some consumer group is still reading through that historical log.
So decreasing partitions is a very, very slow process.
We have to circle back to those data delivery semantics.
We know ACK 1 gives you at most once and ACK 1 or all gives you at least once, but with at least once, you can get duplicates, right?
You can.
If a consumer processes a message but then crashes before it can commit its new offset, when it restarts, it's going to fetch and process that same message again.
That's why your downstream systems often have to be idempotent.
Which leaves the big one.
Exactly.
Once delivery.
Why is this so hard in distributed systems?
Because it requires basically a distributed transaction across all the components, the producer, the queue itself, and the consumer's processing logic.
It has the highest complexity and performance cost, usually needing things like two -phase commits to make sure a message is processed and recorded at exactly one time.
You only use it when you absolutely have to, like for banking.
Exactly.
Okay, before we wrap up, two quick advanced features.
Message filtering.
Is it better to have the consumer pull everything and filter on its side?
No, that's a huge waste of network bandwidth.
It's much, much better to implement filtering on the broker itself, using tags or metadata, so the broker only sends the relevant messages to begin with.
And what about delayed or scheduled messages?
For that, you need some dedicated temporary storage on the broker and a timer.
The message gets held in that temp store until its scheduled time arrives, and only then does it get pushed to the actual topic for consumers to see.
Wow.
Okay, so we went from a simple idea decoupling all the way to engineering a system that has to manage disk access, network latency, distributed state, and fault tolerance at every single layer.
I think what really stands out is that every we made from using a wall to defining the ISR is a deliberate architectural trade -off.
We were constantly balancing durability versus latency and parallelism versus that strict message ordering.
Especially with that consumer group constraint.
Especially with that.
So wrapping up, what does this tell us about the systems we use every day?
It seems like this level of complexity requires specialized solutions for every little edge case.
Like what if a message just keeps failing to be processed?
Exactly.
You can't let that one bad message block the entire stream.
You have to design for that.
You create a dedicated retry topic.
Failed messages get moved off the main topic and onto the retry topic so they can be dealt with later without grinding the whole system to a halt.
It's that final layer of resilience.
A fantastic deep dive into the architecture of distributed systems.
Thank you so much for wonking us through the complexities of a modern message cube.
It was my pleasure.
And thank you for tuning in to the deep dive.
We hope you now feel thoroughly well informed on the inner workings of this foundational piece of modern software.
We'll catch you next time.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- Design a Chat SystemSystem Design Interview - An Insider's Guide (Volume 1)
- Distributed Email Service DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Ad Click Event Aggregation DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Attribute-Driven Design – Creating ArchitectureSoftware Architecture in Practice
- Data at Scale in Interaction DesignInteraction Design: Beyond Human-Computer Interaction
- Data Gathering for Interaction DesignInteraction Design: Beyond Human-Computer Interaction