Chapter 9: S3-like Object Storage System Design

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's take a deep dive into the blueprint for massive scale data storage.

We're designing an object storage service tank, Amazon S3.

We have to start with the scale we're dealing with.

S3 first launched back in 2006, and by 2021, it's stored over 100 trillion objects.

That is a scale that is truly difficult to comprehend.

It really is, and that sheer staggering stale immediately defines our mission here.

When we talk about building a service capable of that kind of growth, we're not just discussing capacity.

We're defining these intense non -functional requirements.

Our sources show we need to design for at least 100 petabytes of storage, and this demands a staggering six nines of data durability.

That's 99 .99999%.

And four nines of service availability, so 99 .99%.

Six nines of durability means statistically you might lose one object out of a million over a year, but the industry standard, like S3's 11 nines, aims to lose only one in every 10 billion.

So we're diving into the architecture, the scaling strategies, and most importantly, the hard trade -offs needed to deliver that kind of promise through a restful API.

Exactly, and before we can build it, we really need to know what kind of storage we're building.

All digital storage systems, they generally fall into three categories.

Okay, let's start with the oldest.

Block storage.

This is the most flexible type, right?

It gives you raw blocks of data, presented as a volume, like a classic HDD or SSD.

Yep, it's used by systems that need direct control, like databases or virtual machines.

It often uses these specialized protocols like Fiber Channel or ISCSI.

And building on top of that, you get file storage.

This is what we're all familiar with.

It provides that hierarchical directory structure, you know,

inside of folders, using protocols like SMB or NFS.

It's the standard general purpose solution.

And then finally, the subject of our deep dive, object storage.

This is the newest model, and it makes a really deliberate, really fascinating trade -off.

It sacrifices that raw performance for just colossal scale, extreme durability, and low cost.

It stores data as these flat unstructured objects, and you access everything through a restful API.

Perfect for unstructured data like images, longs, backups, that sort of thing.

That trade -off is absolutely fundamental.

I mean, if you compare them, block storage gives you high performance, but it doesn't scale as well.

Object storage gives you vast scalability and the lowest cost, but that performance is only low to medium.

And this is where the terminology really starts to matter.

Let's nail that down.

First, the bucket.

It's the logical container for objects, and its name has to be globally unique.

Then the object itself, which is the piece of data, the payload, plus its metadata, which are just simple name value pairs.

And crucially, versioning.

This is a bucket level feature that keeps multiple variants of an object.

It's what protects you from accidentally deleting or overwriting something important.

And access is always through a URI, that unique URL identifier that the API uses.

Tying it all together is the SLA, the service level agreement.

The sheer number of nines we mentioned, 11 nines of durability for some systems, that requires designing for failure isolation from the ground up.

You have to ensure resilience, even if an entire availability zone, an AZ just goes down.

So to meet those kinds of durability and scalability goals, we have to stick to some pretty non -negotiable design principles.

The first principle is immutability.

Objects cannot be incrementally changed.

If you want to change one single byte in a one gigabyte file, you have to delete the old version and replace it entirely with a new one.

Right.

And fundamentally the system operates as a key value store where the object URI is the key and that massive data payload is the value.

That immutability ties directly into the access pattern.

Object storage is overwhelmingly optimized for write once, read many times or worm.

Why is that?

Well, because research like the work done by LinkedIn shows that roughly 95 % of all requests are read operations.

That dictates that our architecture has to prioritize read speed and data retrieval efficiency above all else.

And this brings us to what you said is the most crucial architectural separation, the thing that's key to scaling this entire system.

Yes, we mimic that old UNIX file system philosophy by cleanly separating the system into two totally distinct components.

So first we have the meta store that stores all the object metadata things like the object name, its size, the owner, the version ID.

This data has to be mutable because we're constantly updating things like version markers and access permissions.

And then we have the data store.

This is the massive distributed component that stores the actual object data, the payload.

And because of that immutability principle we just talked about, this data never changes once it's written.

So this separation, it simplifies the whole system dramatically, doesn't it?

It lets us choose say high speed transactional databases for that mutable metadata and then these highly durable, low cost distributed systems for the massive immutable data store.

Exactly, you optimize each part for what it does best.

Okay, so we've got the guiding principles of mutability and separating metadata from data.

Let's zoom out and look at the actual architecture handling these billions of API calls.

Where does a request land first?

Every single request first hits a load balancer.

Its job is just to distribute that incoming restful API traffic.

Those requests are then forwarded to the API service.

This is our stateless orchestrator.

And because it's stateless, holding no unique data, we can scale it horizontally just by adding more servers.

That's right.

But before anything serious happens, you need validation.

The identity and access management or IAM service is that central authority.

It validates both authentication, who you are and authorization, what you're allowed to do.

And then below that API service we have our two separated stores.

The metadata service DB, which is our metastore and the data store service.

So let's walk through a critical path.

Uploading an object, let's say script .txt to a brand new bucket.

The process takes several steps and it really highlights that interaction between the mutable and immutable parts.

Okay.

First, the client sends an HTTP PUT request to create the bucket itself.

The API service checks with IAM for right permission.

And if it's authorized, the metadata store creates the bucket entry.

A success message is returned.

Right.

Then the client sends a separate HTTP PUT for the object payload itself.

Once IAM validates the permissions again, the API service sends that large object payload directly to the data store.

The data store persists the payload, which could be gigabytes of data.

And once it's successfully stored, it generates and returns a unique identifier, a UUID back to the API service.

And here's the final crucial step.

The API service calls the metadata store one last time to create the final human readable metadata entry.

This record is what links the object name script .txt to that system generated UUID.

And that's how the entire system knows that the object name we see belongs to the specific payload stored under that UUID.

When we download, the process is just reversed and we're using a flat structure, even if it looks like a hierarchy with names like photos vacation beach .jpg.

Exactly.

So the client sends the GT request.

The API service verifies read access via IMM.

Then the API service queries the metadata store using the object's name to fetch the corresponding UUID.

That's the lookup.

And then the API just uses that UUID to fetch the object data directly from the data store.

And notice that separation again.

The data store only ever deals in UUIDs.

It never sees the human readable names.

The raw data is returned and the API service streams it back to the client.

Now for the deep dive into the data store itself, which is where all that durability comes from.

Internally, the data store relies on three key components.

The data routing service is stateless and handles all the read -write traffic.

But to do that, it needs to know where the data is so it queries the placement service.

The placement service.

You said this is where it gets really interesting

and critical for reliability.

It is.

This service maintains the entire virtual cluster map.

It tracks the physical topology of which server is in which rack, in which availability zone.

This is what ensures that when we replicate data, the replicas are physically separated across different failure domains.

That's absolutely vital for high durability.

And because the placement service is so mission critical.

I mean, if it goes down, the whole system just stops writing, right?

It stops writing.

So it has to be built as a highly reliable cluster.

It's often using consensus protocols like Paxos or Raft among say five or seven nodes.

This guarantees that as long as a majority of those nodes are healthy, the service stays operational and consistent.

And finally, you have the data node.

That's the worker B.

Storing the actual object data, handling replication and sending heartbeats back to report its status.

Okay.

Let's follow the actual persistence flow.

So the data routing service generates a UUID and then it asks the placement service for the location of the primary data node.

Correct.

The routing service then sends the object data directly to that primary node.

And the primary node saves the data locally and then immediately replicates it to two secondary data nodes.

And that forms our standard three copy replication group.

And right here is where we hit our first major trade -off alert.

Ah, consistency versus latency.

So the primary node has to decide when to respond back to the client.

Exactly.

If it waits for a successful storage confirmation from all three nodes, you get the best consistency.

You're guaranteed the data is safe, but you suffer the highest latency.

On the other hand, you could aim for the lowest latency by just waiting for the primary node to persist the data.

That's super fast, but it's risky.

That's eventual consistency and it's probably the worst kind.

It is.

So the most common middle ground is to wait for the primary and just one secondary to confirm the right.

That balances speed and safety pretty well.

This choice made right here in the persistence flow is foundational to the service's entire performance character.

That brings up a really important question.

If a data node is getting millions of small objects like logs or tiny images, storing each one as an individual file seems incredibly inefficient, right?

You'd waste disk blocks and you'd hit file system metadata limits really fast.

Precisely.

The solution is the write -ahead log concept or wall.

Instead of individual files, the data node merges many of these small objects into one large read -write file.

The objects are just appended one after the other, acting like a log.

And when that read -write file hits a certain size, a few gigabytes, it gets marked as read -only.

And a new read -write file is open to take all the new incoming objects.

Right, and to find a small object that's tucked inside one of those massive files, the data node has to maintain a local index.

We'll call it the object mapping table.

This table just uses the object's UUID as the key, and it maps it to the exact file name, the start offset within that file, and the object size.

And because that index is isolated to that specific data node, you can use simpler, reliable databases like Squilite for this local mapping.

It avoids the whole distributed complexity of a large cluster database.

Exactly.

So we've talked about three -copy replication across failure domains, node, rack, and AZ, which gets us to our target of six nines of durability.

But for maximum efficiency and even higher durability, we need to consider erasure coding, or EC.

Erasure coding.

This dramatically reduces storage overhead, doesn't it?

Three -copy replication is 200 % overhead.

You need three times the storage.

Yep.

But erasure coding, with a setup like eight plus four, breaks the data into eight chunks and then calculates four parity blocks.

It distributes all 12 of those pieces across 12 different domains.

So that eight plus four setup only has 50 % storage overhead, but our durability shoots way up to those 11 nines we mentioned earlier.

So what's the trade -off?

There has to be one.

The trade -off is computational cost and speed.

EC requires a lot more computation during the write phase to calculate all those parity blocks, and reads are slower because the data routing service has to reconstruct the object by reading data from at least eight healthy nodes.

Replication is faster for reads and writes, but EC is just far superior for sheer durability and storage cost efficiency.

Okay, one last thing on durability.

Data integrity.

Corruption can still happen, even with perfect replication.

That's where checksums, like MD5, come in.

A checksum is generated for the object and stored as metadata.

When the object is read, the system recalculates the checksum.

If the stored and the calculated values are different, the data is corrupted, and it has to be retrieved from a healthy replica in another failure domain.

Okay, let's move up a level to scaling the metadata.

With hundreds of billions of objects, the metastore won't fit on a single machine.

We have to shard it.

If we shard by bucket name, that seems simple, but you'd get hotspot issues if one bucket becomes really popular.

Exactly, so the optimal strategy, since most operations are based on the object's URI, is to shard by the hash of a combination of the bucket name, object name.

Hashing that full URI distributes the load much more evenly across all the shards.

It makes finding, inserting, and deleting super efficient.

Speaking of URIs, S3 uses prefixes to simulate a directory hierarchy.

But handling a list objects query in a sharded environment seems incredibly complicated, especially when you add pagination with offset and limit.

It is extremely complicated.

If your metadata is sharded by the object name hash, a listing query which needs to return objects sorted alphabetically across all shards has to query every single shard.

Then it has to aggregate the results, sort them, and then figure out the offset for the page the user requested.

Tracking those offsets across dozens of shards is a distributed nightmare.

So the architectural solution is a trade -off for simplicity then.

We denormalize the listing data into a separate specialized table that's sharded only by bucket ID.

Right, it simplifies the listing query.

It's much faster to find everything in one bucket, even though it means accepting suboptimal performance for that specific operation compared to simple point lookups.

Let's quickly detail two crucial advanced features.

First, object versioning.

When you override a versioned object, we don't update the old metadata record.

We insert a new record with a new UUID and a new object version, which is usually a time and E.

The current version is just the one with the largest time and E.

And if you delete an object, we insert a delete marker, which acts as the current version and just returns a 404 error while keeping all the old data safe.

And for uploading truly math of files, the risk of a network failure is way too high for one continuous stream.

The solution is multi -part upload.

Yes, the large object is sliced into smaller parts, uploaded independently, and each part has its own checksum verification or ETag.

The data store doesn't even try to assemble the object until the client sends a final completion request verifying all the uploaded part numbers.

This lets you resume large uploads and you can verify them chunk by chunk.

And finally, storage can't scale forever without some cleanup.

Garbage collection or GC.

How do we reclaim all that space?

We use compaction.

And this isn't an instant deletion process.

The GC service periodically initiates compaction by copying valid active objects from lots of old fragmented read -only files into a few large new clean read -only files.

Once that copy is complete, it updates the object mapping table with the new file locations, and then finally deletes the old fragmented files.

That's how it reclaims the space.

So what does this all mean for the designer?

We've seen that designing S3 -like storage is really about strategic separation.

The mutable metastore handles all the indexing and lookups, while the immutable datastore stores all the heavy payloads.

And we established that durability is achieved either through robust three -copy replication or for that maximum efficiency and reaching 11 nines through the computational power of erasure coding.

And scaling that object count into the trillions requires very careful sharding strategies, usually based on the hash of the full object URI.

And here's the final provocative thought for you to carry forward.

We spent a lot of time discussing the right trade -off between consistency and latency.

Given that this system is overwhelmingly optimized for read, that worm pattern does prioritizing the absolute fastest read speed always outweigh the higher resource and maintenance cost that comes with providing strong immediate consistency across a vast globally distributed object store.

That is the foundational cost benefit analysis that defines every single cloud storage giant out there.

Thank you for taking this deep dive with us.

We hope this shortcut helps you feel well informed about the complex world of S3 -like object storage design.

β“˜ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Object storage systems represent a fundamental architectural approach that prioritizes durability, cost efficiency, and massive scalability over raw performance, making them ideal for archival and long-term data retention. Unlike block storage or hierarchical file systems that emphasize flexibility and rapid access, object storage services flatten the data model into a simple key-value structure exposed through RESTful APIs. Designing such a system involves navigating critical trade-offs between consistency and latency, particularly through decisions about data replication topology. A production system must satisfy stringent requirements including petabyte-scale capacity, six-nines durability targets, and support for operations like bucket creation, object uploading and downloading, and version control with identity-based access governance. The core architecture decouples concerns into four components: an API service layer handling client requests, an IAM module enforcing authorization, a data store managing raw object contents, and a metadata store indexing organizational information. Data durability is achieved through replication across multiple availability zones, with a three-way replication strategy providing baseline protection at the cost of 200% storage overhead, while erasure coding such as eight-plus-four configurations dramatically improves durability bounds to eleven nines while reducing overhead to merely 50%. Corruption prevention relies on checksums embedded throughout the data path and verified at multiple stages of transfer and storage. The system handles heterogeneous object sizes through strategies like merging small objects into larger consolidated files to minimize wasted inode and block capacity. Metadata scaling is accomplished via hash-based sharding on composite bucket and object identifiers, enabling linear growth. Advanced workflows include multipart uploads for efficient transfer of very large objects by decomposing them into chunks, automated garbage collection and compaction to reclaim space from deletions and abandoned operations, and versioning implemented by assigning unique identifiers to each object state rather than destructive overwriting. These design decisions collectively enable a system capable of serving exabyte-scale workloads with exceptional availability and durability guarantees.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML β™₯