Chapter 1: Proximity Service System Design

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today, we're tearing down the blueprint for one of the most critical systems in modern tech,

the proximity service.

Right.

This is the engine behind, you know, everything from Yelp to Google Maps.

The thing that instantly tells you where the newest coffee shop or gas station is.

Exactly.

So our mission today is to walk through the design process as if we were building this system for you right here.

We need something with high availability, low latency that takes a user's latitude and longitude and just, boom, returns businesses nearby.

Up to about, what, 20 kilometers?

Up to 20 kilometers, yeah.

Yeah.

And it has to handle massive scale without, you know, melting down.

OK, so let's set some ground rules.

What are the core deliverables?

And maybe more importantly, what are the constraints we're working with?

On the functional side, there are three key things.

First, the obvious one, return all businesses based on location and radius.

Second, and this is a huge one, business owners can add or update their info.

But it doesn't have to be real time.

Exactly.

The changes do not need to be reflected in real time.

We have a business agreement that new information is effective the next day.

Oh, that next day update is just gold.

Yeah.

I mean, if we had to manage real time consistency for 200 million businesses, the complexity would just go through the roof.

It's a total game changer.

Let's just manage consistency on like a 24 hour cycle instead of milliseconds.

Right.

And the third functional bit is simple.

Customers need to be able to view detailed business info.

Now for the non -functional requirements, this is where it gets tough.

Low latency is everything.

Low latency is absolutely critical.

Users expect instant results.

And of course, we have to comply with data privacy laws and the whole system needs to scale and be highly available for peak traffic.

Before we start drawing boxes and arrows, we have to know the scale of the problem.

Let's do a quick back of the envelope calculation.

We're talking 100 million daily active users.

And a directory of 200 million businesses.

So we need to figure out the search queries per second, the QPS.

Let's say an average user makes about five searches a day.

And to make the math easy, we'll use the approximation of 100 ,000 seconds in a day.

It's close enough.

Right.

It's really 86 ,400, but this is much easier for mental math.

So it's 100 million users times five queries, and you divide that by 100 ,000 seconds.

That brings us right to 5 ,000 queries per second.

5 ,000 QPS.

And that's just our baseline, our sustained throughput.

It doesn't even account for peak spikes or detail view requests.

This system really, really needs optimization.

Okay.

5 ,000 QPS is our target.

So let's get into the design.

How are we structuring this?

What does the API look like?

Structurally, we'll have two main services behind a load balancer.

But first, the API.

The main read endpoint is super straightforward.

Get zv1 search nearby.

Just takes latitude, longitude, and an optional radius.

And the upticks.

All handled by a separate set of business management APIs.

And what about the data itself?

This is so incredibly read heavy.

The writes are pretty infrequent, right?

Very infrequent.

That pattern makes a standard relational database, something like MySQL, a really strong and reliable choice.

It's great at scaling reads through replication.

So what's the schema look like?

It kind of reflects that split.

We have a main business table with all the detailed info address, hours, photos, all indexed by a business ed.

But the real magic, the part that handles that 5 ,000 QPS, is a second specialized table.

The geoindexed table.

The geoindexed table.

Yeah.

It exists purely for those spatial operations.

Okay, so picture the architecture for us.

We've got two stateless services.

Yes.

The first is the location -based service, or LBS.

This is the search engine.

It gets the high QPS requests, finds the right business IDs, and because it's stateless, we can just scale it horizontally forever.

And the second one?

The business service.

It handles the low frequency writes from business owners, but also the high QPS reads when a user clicks to see a business's details.

And both of these are talking to a database cluster.

Standard, primary, secondary setup.

Exactly.

The primary handles the writes.

And multiple read replicas handle the huge volume of read traffic.

And again, because of that next day update rule, we just don't care about a little bit of replication delay.

That brings us to the absolute core of the problem, the nearby search itself.

The naive approach, the first thing you might think of, is a simple two -dimensional SQL query.

Tell us why that's a complete disaster at this scale.

Right.

The intuitive thing is to just do a SQL query with between clauses on latitude and longitude columns.

And you'd index both of those columns.

It seemed reasonable.

But standard database indexes are really only good at querying one dimension at a time.

So when you query two, the database has to go find one massive set of data for the latitude range.

And another massive set for the longitude range.

Yes.

And then it has to do this incredibly slow massive intersection of those two giant sets to find the points that are in both.

Is it basically a table scan?

It just doesn't work.

So the goal then is to somehow map that 2D data, the lat and long, into a single dimension.

Exactly.

So we can use a highly efficient single column index.

We thought about a simple grid, but that just falls apart because of population density.

You'd have one grid cell over Manhattan that's just slammed with data.

While 99 % of the grid cells over the ocean are totally empty, it's not a good fit.

So we need something more dynamic.

Which brings us to GeoHash and Quadtree.

Let's start with GeoHash.

GeoHash is really elegant.

It takes the 2D coordinates and it reduces them down to a single one dimensional base 32 string.

How does it do that?

It recursively divides the world into four quadrants, encoding the position as it goes.

And the length of that final string determines the precision, or the size, of the grid cell.

So a 500 meter search radius might map to a GeoHash of length 6.

And a bigger 20 kilometer radius would use a shorter, coarser GeoHash, like length 4.

Precisely.

This sounds perfect.

You get a single string you can index.

And there's a huge catch, right?

The boundary issues.

This is the critical design trap you have to account for.

GeoHash guarantees that if two points share a prefix, they are close.

But the reverse isn't true.

Proximity does not guarantee a shared prefix.

Give us an example.

Imagine you're standing 10 meters away from a restaurant, but you're on one side of a major GeoHash boundary, like the prime meridian, and it's on the other.

Your GeoHash string and the restaurants could be completely different, like no shared prefix at all.

So if your service just searches for businesses with your exact GeoHash prefix,

you'd miss the restaurant that's literally right across the street.

A total failure for the user.

It would be.

And you'd see it all the time.

The solution is crucial.

The system can't just search the user's current GeoHash grid.

It also has to calculate and search its eight immediate neighbors.

Hold on a second.

If we're searching the current grid plus eight neighbors, aren't we doing nine separate queries?

Isn't that going to crush our performance, especially at 5 ,000 QPS?

That's a great question.

But the key is that we're doing nine lookups on small indexed grids.

Not one massive unindexed query.

We run those nine lookups in parallel, merge the results, and it's still way, way faster than that doomed 2D intersection we talked about.

OK, that makes sense.

A worthwhile trade -off.

And if we don't find enough businesses,

we just drop the last character from the GeoHash strings and search again, effectively widening the search area automatically.

All right.

Let's talk about the alternative, the quadtree.

This is a tree -based approach.

How does it improve on GeoHash's fixed grid sizes?

A quadtree is an in -memory data structure.

It also recursively subdivides space into four quadrants, but it stops dividing when a leaf node meets some criteria, like having a hundred or fewer businesses in it.

So dense areas like downtown London get broken down into tiny, tiny nodes.

While huge empty areas like the Pacific Ocean might be just one single massive node,

it handled that uneven density perfectly.

You said it's in -memory.

How big would this index actually be for our 200 million businesses?

The math is surprisingly manageable.

If you add up the coordinate pairs, the pointers, and a little metadata, the total index size comes out to around 1 .71 gigabytes.

That's nothing.

That fits easily in a modern server's RAM.

Lookups would be incredibly fast.

Lightning fast.

OK, so if quadtree handles density better and it fits in memory,

why did we even talk about GeoHash?

What's the catch?

What's the operational trade -off here?

Well, first, building that quadtree takes time.

A few minutes every time a server starts up.

That makes deployments more complicated.

But the real headache, the real problem is updates.

Adding or deleting a business means traversing the tree, maybe rebalancing it.

And if you have high throughput rights, you need really sophisticated, multi -threaded locking mechanisms to stop the index from getting corrupted.

It's very, very difficult to get right.

So with GeoHash, an update is just a simple row, insert, or delete in a database.

It's atomic.

It's simple.

Concurrency is basically a solved problem.

Exactly.

Quadtree gives you maybe better search characteristics, but at the cost of massive operational complexity.

And given our next day updates, GeoHash is a very compelling choice because of its simplicity.

We should probably mention Google's S2 as well.

We should.

It's sort of the state of the art.

It maps the sphere using something called a Hilbert curve, which preserves proximity really well.

But for this discussion, GeoHash and Quadtree are more practical to dissect.

OK.

We have our indexing algorithm.

Now step three, scaling and optimization.

Let's go back to the database.

We said we'd shard the business table by business in.

How do we structure and scale the geospatial index table?

We use a compound key, GeoHash and business in together.

This is so much better than, say, storing a JSON array of business IDs because it makes and removes atomic without needing complex row locking.

Now for scaling that index, we have 5 ,000 QPS hitting it.

Does that 1 .71 gigabyte index need complex application level sharding?

This is the key insight.

No, it absolutely does not.

We don't need sharding for data size 1 .7 gigabytes fits on a single server.

We only need to scale for read throughput.

What's the better strategy?

The better strategy is to just use standard database replication.

We use multiple read replicas to spread out that massive read load.

It's way, way simpler than building custom application layer sharding logic.

And we sacrifice nothing.

The right tool for the job.

Brilliant.

Now let's talk caching.

Why do we even need a cache if the whole index fits in the database server's memory?

It's still a good idea.

It reduces database load even further and can improve user latency.

But the cache key selection is everything.

You can't use raw user coordinates.

Right.

Because your GPS signal jitters slightly.

You'd have a cache miss on every single request.

The cache would be useless.

Exactly.

The key is to use the geohash string itself as the cache key.

Any small position change that keeps you in the same grid cell will result in a cache hit.

It's the perfect balance.

And what are we caching exactly?

Two types of data.

Two types.

First, the geohash cache.

The key is the geohash string at different precision lengths four, five, and six.

The value is a list of all the business IDs in that grid.

This whole data set is also small enough, maybe five digs, to be deployed globally.

The business info cache.

The key here is the business ad.

And the value is the fully hydrated business object.

A hydrated business object.

Bring that down for us.

It just means we take that bare business ID from the index and we fetch the whole package.

The name, address, photos, rating, everything the user actually sees on their screen.

We cache that whole package so we don't have to hit the database every time someone clicks on a business.

And again, our next day update requirement makes cache invalidation easy.

It's all handled by a simple nightly batch job.

What a win.

It's a huge win for stability.

And for latency, we also have to deploy our LBS in multiple regions around the world.

US, West, Europe, Asia.

Right.

And that does three things.

It reduces latency because you're physically closer to the user.

It distributes traffic.

And it helps with data privacy laws by keeping location data inside its region of origin.

Okay, let's walk through the full end -to -end read flow.

I open the app.

I search for restaurants.

What happens?

Okay, step one.

Your phone sends your location and a radius, say 500 meters, to the load balancer.

It routes you to the nearest regional LBS.

The LBS determines the geohash length it needs.

For 500 meters, that's length six.

Step three.

It calculates your current geohash key and its eight neighbors.

Then what?

Step four.

The LBS calls the local geohash Redis server in parallel for all nine geohashes.

It gets back a list of candidate business IDs.

Parallel fetching is key for speed.

Step five.

With those IDs, it then calls the business info Redis cache to get the fully hydrated business objects.

And finally, step six.

The LBS does the final precise distance calculation on this now much smaller set.

Applies any last minute filters.

Like, show me places open now.

Exactly.

It ranks them and sends the final list back to your phone.

And writes are just beautifully simple.

An API call hits the business service, writes to the primary DB, and the cache gets updated overnight.

We did it.

A highly available, high throughput proximity service.

Good.

We covered scoping, QPS, the LBS and business service split, the trap of 2D searching, and how geohash fixes it while handling the boundary problem.

Then we scaled it with replicas instead of sharding and layered on a super efficient cache using the geohash as the key.

So,

looking at all these engineering tradeoffs, what's the big takeaway?

We saw that geohash updates are incredibly simple.

Just an atomic row operation.

Right.

And I think the final thought for you to reflect on is just how much engineering complexity we avoided by choosing that simplicity.

Think about the massive effort needed to manage an in -memory quadtree.

The startup latency, the complex rebalancing, and especially the multi -threaded locking you'd need for concurrent updates.

It's a nightmare at scale.

It really is.

And it just highlights how valuable the simplicity of geohash is.

Where using a standard database with atomic row operations solves almost all of your concurrency problems right out of the box.

It lets your engineers focus on scaling reads, not on managing these crazy complex memory structures.

Simplicity often scales better than theoretical perfection.

That was a fantastic deep dive.

Thank you for walking us through that entire blueprint.

My pleasure.

Thank you for joining us.

If you've ever wondered what it takes to build the infrastructure that can find anything anywhere instantly, you now have the map.

Keep learning, keep designing, and we'll catch you on the next deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Designing a proximity service requires architecting a system capable of rapidly identifying nearby points of interest while managing enormous read volumes and strict latency requirements. The foundational phase establishes critical performance targets, including sub-second response times and compliance with data protection frameworks such as GDPR and CCPA, alongside capacity planning for thousands of concurrent search operations per second. The resulting architecture partitions responsibilities between the Location-Based Service layer, engineered as a read-optimized, horizontally scalable component, and the Business Service tier, which manages infrequent data updates and delivers comprehensive business details. Because read operations dominate the workload, database scalability leverages primary-secondary replication with eventual consistency guarantees, since business information modifications only require propagation within a daily cycle. The central technical problem involves efficiently storing and querying two-dimensional geographic coordinates. Naive approaches such as linear scanning or uniform grid partitioning fail because real-world location data exhibits highly skewed spatial distribution, creating severe performance bottlenecks in densely populated regions. Advanced geospatial indexing techniques address this by converting two-dimensional coordinates into one-dimensional representations, with Geohash and Quadtree serving as the primary strategies. Geohash provides simplicity and straightforward cache integration using stable string prefixes rather than variable coordinate pairs, though proximity queries near string boundaries require careful handling since nearby physical locations may not share common hash prefixes. Quadtree employs recursive spatial subdivision weighted by population density, enabling efficient retrieval of k-nearest neighbors through its hierarchical structure, though tree rebalancing during index updates introduces operational complexity. For data persistence, the business information table uses business ID-based sharding to distribute write and update operations, while the geospatial index remains unsharded and instead scaled through read replicas, avoiding the architectural complexity of partitioning spatial data. Multi-layer caching enhances throughput by storing both business identifier lists and fully hydrated business objects in memory, keyed by stable geohash values to maximize cache hit rates across repeated queries. Global deployment across multiple geographic regions and availability zones minimizes cross-region latency, balances traffic distribution, and satisfies data residency compliance requirements across different jurisdictions.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥