Chapter 13: Design a Search Autocomplete System

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to Deep Dive.

Today we are taking a scalpel to something you use every single day, probably without even thinking about it, the Search Autocomplete system.

Yeah, that little typehead box.

It seems so simple,

but designing one that works for millions of users is a massive challenge.

It really is.

So our mission today is to walk you through that blueprint step by step.

We're going to focus on performance, on scalability, and really on how to make this thing just incredibly fast.

Exactly.

And to do that, we have to start with some ground rules.

We had a few key requirements we locked down.

First, we're only doing prefix matching.

So if you type car, you get carpet, not race car.

Okay, prefix matching.

What else?

Second, we're only returning the top five suggestions.

Our K is five.

And third, they have to be sorted by how popular they are historically, their frequency.

Right, not just alphabetical.

And the last one.

For simplicity's sake, just for the core design, we're sticking to lowercase English characters.

No special characters or anything like that for now.

Okay, let's unpack this.

Starting with the scale.

We're not building this for a small blog.

We're talking 10 million daily active users, DAU.

And that number, 10 million, it changes everything.

It immediately sets our most important goal, which is just pure speed.

How fast are we talking?

The target is under 100 milliseconds, and that's not an arbitrary number.

It's tied to human perception,

anything slower than that, and you feel the lag.

The user feels the system stuttering.

And the number of requests is just staggering.

I mean, think about typing the word dinner.

That's six separate requests, right?

D, D, D.

Exactly.

An average typing session might generate, say, 20 requests.

So you take 10 million users, multiply that out.

You quickly get to some wild numbers.

We're looking at around 24 ,000 queries per second on average.

On average.

And at peak times, you can pretty much double that.

So we have to design for something closer to maybe 48 ,000 queries every single second.

Almost 50 ,000 requests a second, and each one has to come back in under 100 milliseconds.

That's our challenge.

That's the challenge.

So high level, the system kind of breaks into two main parts.

You've got the data gathering service, which is the back end thing collecting all the queries.

Right.

And then you have the query service, which is the front facing part that actually gives you the suggestions.

So let's start with the query service.

What's the most basic, simplest idea that we know is doomed to fail?

Huh.

Yeah, the classic first attempt.

You'd probably use a standard relational database.

You just make a table with two columns,

the query string, and account for its frequency.

And the query would be something like select from table where query like prefix percent.

Order by frequency, DC limit five.

Looks simple.

And for a hundred users, it would probably work just fine.

Here's where it gets really interesting.

Why does that simple approach just completely fall apart when the data gets big?

It's that like prefix percent clause.

It's a performance nightmare of scale.

Databases are just not optimized for tens of thousands of prefix based text searches per second on a massive, constantly growing table.

The database has to do way too much work for every single keystroke.

It's scanning indexes, filtering.

You've blown past your hundred milliseconds before you even get started.

Instantly, the latency would be terrible.

So we need a different tool for the job, a specialized tool.

And that tool is the try, also known as a prefix tree.

Exactly.

It's a data structure that's built from the ground up for exactly this kind of problem, super efficient string retrieval.

So can you sort of describe what a try looks like for us?

Think of it like a tree.

The root node is just an empty string.

Each branch from a node represents a single character.

So to find the word deep, you'd go from the root, well, the D branch, then the E branch, then E, then P.

Okay.

But a basic try just tells you if the word exists, we need to sort by popularity.

Right.

So we have to customize it.

At each node that represents the end of a query, we also store its frequency count.

We have to bake that popularity data right into the structure itself.

So if we use that customized try, what's the time complexity look like at first?

Well, step one is finding the prefix node.

That's OP, where P is the length of the prefix.

Yeah.

So it's four steps.

That's fast.

But that's where the good news ends, right?

Yeah.

This is where it gets slow.

Step two is you have to then traverse the entire subtree underneath that prefix to find all the possible words.

We call that OC, where C is the number of children.

And C could be millions if someone just types T.

Exactly.

And then the final, fatal step.

You have to sort all of those millions of children to find the top five.

That's an OC log C operation.

OC log C.

That is a computational disaster.

You're doing a massive sort operation for every single keystroke.

We're talking seconds of delay, not milliseconds.

It completely fails our speed requirement.

So this brings us to that classic system design moment.

We have to trade space for time.

Because that 100 millisecond response is non -negotiable.

So we're willing to use more memory to get more speed.

A lot more memory.

We make two really critical optimizations.

First one's simple.

We cap the prefix length, let's say, at 50 characters.

No one's typing an autocomplete query longer than that.

OK.

So that makes finding the prefix, which was OP,

effectively O1, or constant time, because P is now bounded by a small constant.

Correct.

But the second optimization is the really big one.

Instead of finding and sorting the results on the fly, we pre -calculate the top five suggestions, and we store that list directly at every single node in the try.

Hang on.

At every node.

So the node for D has a stored list of the top five words, starting with D.

And the node for D has its own list.

Yes.

Every single one.

That seems like a colossal amount of redundant data.

I mean, the memory footprint must just explode.

Is there really no way around that?

There isn't.

If you want absolute speed, that's the trade -off.

We are intentionally paying a massive memory penalty, because retrieval has to be instantaneous.

We're storing the answer for every possible prefix ahead of time.

Wow.

So with that change, let's look at the time complexity again.

Finding the prefix node is O1.

And returning the top five is also O1, because the list is just sitting there waiting.

We just grab it.

So the whole operation becomes O1.

Constant time.

We hit our speed target.

We hit the target.

The query service is now blindingly fast.

But that speed is useless if the data is stale.

So now we have to figure out the other half of the system, the data gathering service.

Right.

How do we feed this thing with fresh data without slowing it down?

I mean, updating the try in real time with every search is obviously out of the question.

Completely impractical.

You'd have billions of writes a day competing with your reads.

It would crush the query servers.

Plus, the top suggestions don't really change second by second.

Exactly.

So you need an asynchronous offline process, a batch process.

The standard way to do this is to rebuild the whole try

periodically, maybe once a week.

OK.

So walk us through that data pipeline.

Where does it start?

It starts with analytics logs.

These are just massive, raw, append -only files.

Every time someone searches, it logs the query.

A timestamp, that's it.

Billions of them.

It's just a firehose of raw data.

A firehose.

So that data flows into the next stage.

The aggregators.

Their job is to process those messy logs.

They run a big batch job, maybe weekly, and just sum everything up.

They count the occurrences of every unique query.

So they turn the raw logs into a clean frequency table.

A very clean, aggregated frequency table.

That table then gets passed to the workers.

These are servers whose only job is to take that frequency data and build our big, beautiful, optimized try structure.

And the workers are the ones that do that heavy lifting of precalculating the top five suggestions for every single node.

That's their main job.

And once the try is built, they save a snapshot of it to the trydb.

Our persistent storage.

For the database, we could use a document store.

Like MongoDB, right?

Just serialize the whole try object and store it.

You could.

Or you could use a key value store.

That's a bit more complex because you'd have to break every node into a key value pair.

The prefix is the key.

The node data is the value.

Why would you do that?

It sounds more complicated.

It's more work up front.

But it can make updating or looking up individual nodes faster.

And it can simplify sharding the database later on.

Either way, the DB is just for persistence.

Because the live query service isn't hitting the database directly, it's hitting the try cache.

Right.

A big distributed in -memory cache, like Redis.

The most recent weekly snapshot of the try lives entirely in memory, in that cache, for those super -fast O1 reads.

So user types D.

The request hits a load balancer, goes to an API server.

And that server immediately asks the try cache for the data at the D node.

If there's a cache miss for some reason, then it goes to the DB, gets the data, and puts it back in the cache for the next request.

Okay, that makes sense.

Now, beyond the server stuff, there are also a few client -side tricks we can use to make things even more efficient.

For sure.

I mean, on the web, you're using AJAX requests, so the whole page doesn't have to refresh.

That's standard.

But browser caching seems like a really big one.

It's huge.

You can tell the user's browser to cache the suggestions for, say, an hour.

So if they type deep, leave, and come back 10 minutes later and type deep again, the browser doesn't even bother making a request to our servers.

It already has the answer.

That must save a ton of load for common queries.

A massive amount.

And on the back end, there's data sampling.

Instead of logging every single one of the billions of queries, maybe we only log, say, one out of every hundred.

Why is that?

Okay, don't we lose accuracy?

A little.

But we don't really care about the exact frequency of some rare typo.

We just need a statistically significant sample to know what's generally popular.

It saves an enormous amount of storage and processing power.

Let's talk about maintenance.

You said we avoid real -time updates.

Why is updating even a single query's frequency so painful in this system?

It's because of our big optimization.

Remember how every node stores the top five results of its children?

Yeah, right.

It's a cascading problem.

It's a huge cascading problem.

If the frequency for Zebra suddenly goes up, you can't just update the Zebra node.

You have to go back and check if that changes the top five list for its parent, Zebra.

For Zeb, and Zeb, and Zye, all the way back to the root of the tree.

All the way back.

It's this incredibly slow, expensive cascading update.

It's much, much easier and more reliable to just rebuild the whole thing from scratch once a week.

So what about deleting things?

I mean, you can't wait a week to get rid of a hateful or illegal suggestion.

Absolutely not.

For that, you need an immediate solution.

You put a filter layer right in front of the try cache.

It's basically a block list.

If a query matches the list, the filter stops it cold before it ever returns a result to the user.

So it's like an emergency break.

The suggestion is blocked instantly.

And then the actual physical removal from the database happens during the next weekly rebuild.

Precisely.

Okay, final big challenge.

What happens when our try is just too big to fit on one server?

We have to shard it.

You have to.

And the naive approach here is to shard it alphabetically.

A through M on one server, N through Z on another.

And why is that a terrible idea?

Because language isn't evenly distributed.

Think about how many more words start with S than with X.

You get these huge hotspots.

One server would be handling 80 % of the traffic while the other is just sitting there bored.

So that creates a massive imbalance.

The solution is something called smart sharding.

Right.

Instead of using the alphabet, you use historical data to draw the boundaries.

You might analyze the traffic and find that the letter S is so busy, it needs its own dedicated server, its own shard.

While maybe U through Z can all share another one because they have so much less traffic combined.

Exactly.

The boundaries are based on actual load, not arbitrary letters.

Yeah.

And you have a little service, a shard map manager, that keeps track of which prefix lives on which server.

That makes the distribution much more even and scalable.

Okay, so before we wrap, what about extensions, like multiple languages?

For that, your try nodes just need to support Unicode instead of basic ASCII.

And if different countries have different popular queries, you'd probably build separate country -specific tries and then use a CDN to serve them from a location close to the user for lower latency.

If we connect this to the bigger picture, we've built an amazing system for historical data, but it has one really big blind spot.

It does.

It completely fails when it comes to real -time trending queries.

If some major news event happens right now, our weekly update cycle is way too slow to catch that sudden spike in searches.

The whole system is built on this batch processing model.

By the time our weekly job runs, the trend is already over.

The system as we've designed it is inherently incapable of showing what's trending right now.

To do that, you'd need a totally different approach.

You'd need to shift from batch processing to stream processing.

A completely different back end.

You'd be looking at tools like Apache Spark Streaming or Kafka.

Systems designed to process a continuous unending stream of data in real -time.

But that's a whole other deep dive.

It is.

But for today, I think we've hit our goal.

We've designed a super scalable autocomplete system by making one core decision prioritize speed above all else.

We did with that custom try structure, making that big trade -off of using more memory to get a one retrieval time.

And then we supported it all with a really robust asynchronous data pipeline.

It's an architecture built for extreme performance.

We hope this gives you something to chew on, something to think about the next time you see those suggestions pop up as you type.

Thanks for joining us.

Thank you for diving deep with us into the architecture of Search Autocomplete.

We'll see you next time on the Deep Dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Building a search autocomplete system requires balancing speed, relevance, and scale to deliver suggestions within 100 milliseconds as users type. The system architecture separates concerns into two primary components: a Data Gathering Service that collects and aggregates billions of search queries from user interactions, and a Query Service that processes incoming prefixes and returns ranked results. The central design challenge involves efficiently retrieving the top five most popular suggestions for any given prefix without relying on inefficient relational database queries. A trie, or prefix tree, provides the foundational data structure by organizing strings hierarchically, allowing rapid traversal to any prefix node. However, standard trie traversal proves inadequate for latency requirements, necessitating two critical optimizations: constraining the maximum prefix length to limit tree depth and, more importantly, caching the top K most frequent queries directly at each node. This caching strategy collapses query complexity to constant time after reaching the prefix node, trading additional memory usage for dramatically reduced response latency. Given the massive query volumes, typically 48,000 queries per second at peak load, the Data Gathering Service operates asynchronously through offline batch processing rather than real-time updates. The system rebuilds the entire trie weekly from aggregated query logs using distributed workers, then persists the completed structure in a document or key-value store and loads it into a distributed in-memory cache accessible to API servers. A filter layer removes inappropriate or harmful suggestions before results reach users. Horizontal scaling demands sophisticated sharding based on actual query distribution patterns rather than naive approaches like sharding by first character, requiring a dedicated shard manager to maintain distribution knowledge. Beyond the core system, extensions such as multi-language support through Unicode encoding and real-time trending query detection through stream processing with dynamically weighted ranking models demonstrate how the architecture accommodates evolving requirements while maintaining performance guarantees.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 13: Design a Search Autocomplete System

Related Chapters