Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Dive.
Today, we're going to get our hands dirty with a real classic.
We're designing a URL shortener from the ground up.
Think tiny URL, but at a massive scale.
Exactly.
And our mission isn't just about, you know, making a long link shorter.
It's about building a system that can handle trillions of requests without breaking a sweat.
It's a fantastic scaling problem.
And like any good system design interview, the very first thing you have to do is set the scope.
You have to ask the right questions.
Right.
So for this, we've established three core use cases.
First, obviously shortening a long URL to a short alias.
The right path.
Second, redirecting that short alias back to the original URL.
The read path.
And third, the big one, making sure the whole thing is highly available, scalable, and fault tolerant.
And once you have that scope, you can start doing the math,
the back of the envelope estimations.
This is where it gets real.
Let's imagine a pretty demanding scenario.
Say 100 million new URLs are created every day.
Okay, 100 million a day.
That breaks down to roughly 11 ,160 write operations per second, which is a lot, but manageable.
But the reads, that's where the system really gets hammered.
It always is.
If you assume a pretty standard 10 to 1 read to write ratio, you're suddenly looking at 11 ,600 read operations every single second.
So right away, you know the design has to be optimized for reads heavily.
Absolutely.
And then there's longevity.
If this thing runs for 10 years, you're looking at 365 billion records to store and manage.
Exactly.
And if we assume an average long URL is about 100 bytes, you can do the math pretty quickly.
That's what, around 50 .8 terabytes of data over that decade.
So a single database server is just, it's out of the question.
Completely.
These numbers tell us from the very beginning that we're going to need things like database sharding and some serious caching.
Okay, so before we dive into the guts of the system, let's define the front door, the API.
Yeah, let's keep it simple.
Standard rest endpoints.
For the write operation, the shortening a client just sends a PUSD request.
Something like a P of one shorten with the long girl in the body.
And the service just sends back the new short URL.
Simple enough.
And for the read path, the high volume part, it's just a GT request to that short alias.
Right.
And that simple GT request forces our first major architectural decision.
It's about the HTTP redirect code.
301 versus 302.
Okay, let's break that down.
A 301 means the URL has moved permanently.
Correct.
And the big implication there is that the user's browser will cache that redirect.
So after the very first click, the browser remembers the mapping.
The next time the user clicks the short link, it doesn't even talk to our service, it just goes straight to the destination.
Which sounds great, right?
Yeah.
Massively reduces the load on our servers.
It sounds perfect.
So why would you ever use anything else?
Data.
Analytics.
If your business depends on tracking every click, you know, for analytics, for attribution, a 301 makes you glide to all subsequent clicks from that user.
Ah, okay.
So that's where the temporary redirect comes in.
Exactly.
A 302 tells the browser,
this is just a temporary location.
So the browser never catches it permanently.
Every single click comes back through our servers first, which means we can log it, we can analyze it, we can track it.
And then we send the user on their way.
Precisely.
So it's a trade off.
301 for reducing server load, 302 for collecting data.
The choice depends entirely on the business needs.
Got it.
So let's talk about how we store that The initial thought is just a hash table, right?
A key value pair, short URL, long URL.
Yeah, conceptually, it's a hash table.
But with 365 billion records, it can't just be in memory.
We need a real database, probably a relational one with columns for say, id, short URL and long URL.
Okay, that makes sense.
Which brings us to the core of the short URL itself, that little string of characters.
We need to figure out how long it needs to be.
Right.
We decided our character set is zero through nine, lowercase a to z and uppercase a to z.
That's 62 possible characters.
So we need to find the smallest length, let's call it n, where 62 to the power of n is bigger than our 365 billion record requirement.
And if you run the numbers, 62 to the power of six gets you about 56 billion combinations.
Not enough falls short of our 10 year goal.
But if you just add one more character, n equals seven, then 62 to the power of seven jumps you up to about 3 .5 trillion unique possibilities.
Way more than we need.
So a seven character string is perfect.
It gives us plenty of headroom for the future.
Okay, so we've settled on seven characters.
Now for the really interesting part, how do we actually generate that unique seven character string?
This is the central design challenge.
There are really two main ways to go about it.
The first one feels pretty intuitive.
You take the long URL, run it through a standard hash function like MB5 or SHA1 and then just chop off the first seven character.
Right, we can call it the hash plus collision resolution method.
But there's an immediate and very serious problem.
Collisions.
You're taking a huge hash and truncating it down to a tiny seven character string.
It's almost guaranteed that two different long URLs will eventually produce the same seven character hash.
And we absolutely cannot have that.
So what's the resolution part?
Well, it's a bit clunky.
When you generate a hash, you first have to check the database to see if it's already been used.
A database read for every single write.
That sounds slow.
It is.
And if it is a collision, you have to do something like append a random number or timestamp to the original long URL and then rehatch the whole thing.
And check the database again.
And check again.
You keep doing that in loop until you find a unique one.
Yikes.
That could get really slow if you have a chain of collisions.
The performance seems unpredictable.
It is a major drawback.
Now you can optimize it.
You can put something called a bloom filter in front of the database.
Okay, let's define that for a second.
A bloom filter is.
It's like a really fast, lightweight bouncer for your database, right?
That's a great way to put it.
It's a probabilistic data structure.
You ask it, hey, have you seen this hash before?
It can tell you one of two things.
Definitely not or maybe.
So if it says definitely not, you can just skip the expensive database check and write the new record.
Exactly.
You only have to hit the database when the bloom filter says maybe.
It saves a huge number of unnecessary lookups.
That makes the collision approach a lot more feasible.
But what's the second option?
The second approach is, I think, a lot more elegant.
It's called base 62 conversion.
Okay.
So this one doesn't hash the long URL at all.
Not at all.
Instead, it relies on having a unique number for every new URL.
Think of it like a primary key from our database, a unum ID.
And since that ID is guaranteed to be unique.
The short URL you generate from it is also guaranteed to be unique.
No collisions.
Ever.
So how does it work?
You take a big base 10 number, like an ID, and just convert it to base 62.
Precisely.
You take a huge ID like, say, 2009215674938.
You run it through the base conversion math and out pops a short string like Sinonite EGA.
The uniqueness is built in.
That sounds so much cleaner.
No collision resolution loops.
No bloom filters.
It is cleaner.
But let's compare them directly.
The hash method gives you a fixed seven -character length, which is nice and consistent.
But you have that whole messy collision logic.
Whereas base 62 is collision -free and simpler to implement.
But it has its own wrinkles, right?
It does.
First, the length isn't perfectly fixed.
As the unique IDs get bigger and bigger, eventually the short URL will have to grow from seven characters to eight.
Although for our 10 -year projection, seven is enough.
Right.
The bigger issue is a potential security concern.
If your unique IDs are just sequential one, two, three, four, then your short URLs will also be predictable.
Ah, so someone could guess the next short URL in the sequence.
That's not good.
It's a real vulnerability.
But we can mitigate that.
So despite that issue, we're going to choose the base 62 approach.
The guarantee of no collisions is just too valuable.
Okay, so decision made, base 62.
This means our right path, our shortening flow, now has a very specific set of steps.
It does.
So step one, a user sends us a long URL.
Step two, the very first thing we do is check our database to see if we've already shortened that exact same URL before.
That's for deduplication, right?
To save space.
Exactly.
If we find it, step three, we just return the existing short URL.
No need to create a new one.
But if it's a brand new URL.
Then, step four, we have to generate a new globally unique ID.
And this is where the distributed nature of the can we?
That would become a massive bottleneck.
A single point of failure.
You would kill our performance.
Instead, we need a dedicated, distributed, unique ID generator.
Something like Twitter's Snowflake algorithm is a common solution.
So Snowflake generates unique IDs across a whole fleet of servers without them having to talk to each other constantly.
Right.
It combines a timestamp, a machine ID, and a sequence number to create an ID that's guaranteed to be unique across the entire system.
It's the engine that makes this all work at scale.
Got it.
So we get our unique ID from our Snowflake service.
Then step five is the base 62 conversion we just talked about.
Yep.
Turn that big number into our seven -character string.
And finally, step six, we write the new mapping, the ID, the new short URL, and the original long URL into our database.
And that's the full right path.
Now we have to make sure the read path is lightning fast to handle that 10 to 1 ratio.
Which means one thing.
Caching.
Lots and lots of caching.
It's non -negotiable.
So let's walk through the redirect flow.
A user clicks a short link.
The request goes to our load balancer.
Which forwards it to one of our many web servers.
Yeah.
The first thing that web server does is check a cache.
Probably a distributed in -memory store like Redis.
It's looking for that short URL, long URL mapping.
Right.
And if it's a cache hit, which it will be for any popular link.
The web server just grabs the long URL from the cache and sends the redirect back to the user.
Super fast, super low latency.
But if it's a cache miss.
Then the server has to do the heavy lifting.
It goes to the main database to find the long URL.
Right.
It fetches the record.
And importantly, before it sends the redirect back to the user, it writes that mapping into the cache.
So the next person clicks that link gets a cache hit.
Exactly.
It warms up the cache.
You mentioned the database earlier and how 50 terabytes is way too big for one machine.
Let's talk about the solution there.
Sharding.
Sharding is essential.
It just means we're splitting our data horizontally across many different database servers.
So instead of one server holding 50 terabytes, maybe we have 100 servers or shards.
Each holding a manageable 500 gigabytes.
Okay.
But if the data is spread out like that, how does our web server know which of the 100 machines to ask for a particular short URL?
We use a consistent sharding key.
The web server takes the short URL, runs a simple hash on it.
And the result of that hash tells it exactly which shard number, which database server holds the data.
So every lookup is targeted to a single machine.
We don't have to search all 100 servers.
Never.
That would be incredibly inefficient.
This way, the lookup is still very fast.
Even with the data spread across a huge cluster.
That really paints the full picture.
So beyond this core architecture, what else do we need to make this a truly robust enterprise grade service?
A few more key components.
First, a rate limiter is absolutely critical.
You have to protect your right API from abuse.
So you'd block a single IP address from sending millions of requests a minute and overwhelming our ID generator.
Precisely.
You also need to think about scaling the different tiers.
The web server tier is stateless, so that's easy.
You just add more machines.
But the database tier, the stateful part, is harder.
That's where you rely on the sharding and replication strategies we discussed.
And finally, you need an analytics pipeline.
We chose the 302 redirect so we could collect data.
So we need a system to actually log and process all that click data to generate business insights.
Fantastic.
So to recap, we've designed a system that can handle billions of URLs.
We chose the base 62 conversion method because it's collision free.
That's powered by a distributed unique ID generator, and the high volume read path is protected by a massive caching layer.
And the main theme has really been about the trade -offs at every step.
301 versus 302, hashing versus base 62.
Every choice has a direct impact on performance, cost, and features.
So here's a final thought for you to chew on, something that brings it all together.
Think about the CKP theorem, consistency, availability, and partition tolerance.
In a distributed system like this, you can't have all three.
So if our system, during some kind of network failure, decides to prioritize availability over perfect consistency, what might that actually look like to a user?
If someone clicks a link that was just created, might they get an error for a few seconds?
Or worse, could they briefly be redirected to the wrong place?
Something to think about.
Until next time, keep digging deeper.