Chapter 7: Log-Structured Storage

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to The Deep Dive, the show where we take a mountain of dense information, slice it up, and find the crucial, glittering nuggets of knowledge you need to be truly well informed.

Today, we are strapping on our virtual helmets and plunging into the bedrock of modern data storage.

We really are.

We're getting into the engine room of high -performance databases.

We're talking about log -structured systems.

Yeah, this is a deep dive focused entirely on storage engine architectures, indexing structures, and I think most importantly, the brutal trade -offs you have to make to cope with the sheer, unrelenting volume of data today.

So if you've ever wondered how systems like Apache Cassandra or Google's Bigtable or even RoxDB achieve their phenomenal write throughput, this is where the answers are.

This is absolutely essential listening.

And we're going to start in a maybe a surprising place, the financial sector.

There's this legendary quote, often credited to storage pioneer Pat Helland.

Oh, I know this one.

Accountants don't use erasers or they end up in jail.

Just think about that for a second.

It's perfect.

If an accountant makes a mistake on a ledger, they don't just scrub out the old number.

No, you can.

You can't.

You append a new entry, a new line item that corrects or supersedes or just zeros out the previous one.

The whole history stays right there, intact.

And that simple, immutable principle is the absolute core of what we're exploring today in database architecture.

It's mutability versus immutability.

That is the central conflict in storage design.

OK, so on one side of that conflict, you have the old guard, the mutable structures.

We're talking about the venerable B -trees.

Right.

They're built for reads.

They're optimized to modify records in place.

You find the record, you update its little spot on the disk page, and you're done.

But that first step, the find the record step that has to happen for every single write, and that is its fatal flaw.

Because that means random I .O., slow, expensive random I .O.

And then on the other side, you have the accountant's ledger philosophy.

These are the immutable structures and specifically what we're diving into today.

Log structured merge trees or LSM trees for short.

Their whole philosophy is just don't overwrite data on disk ever.

To change a record, you just hold that change in memory for a bit, and then you append the new version to the end of a file.

Which guarantees sequential writing.

Which is lightning fast on any modern hardware, whether it's an old spinning disk or a new SSD.

So okay, if B -trees are all about fast reads, but they totally sacrifice write speed,

LSM trees are doing the exact opposite.

They are built to maximize write throughput.

Yeah.

But what's the cost?

The cost is complexity.

Yeah.

And latency.

But on the read side, since you never overwrite the old data, the correct version of any given record might be spread out.

You could have three, four, five different versions of it in different files on disk.

So to get the single source of truth, you have to go look in all those places and figure out which one is the newest.

You have to reconcile them.

Our mission today is to thoroughly unpack this thing, the LSM tree.

We need to get how it's built,

understand its complex life cycle, and really dig into the crucial trade -offs it forces on us.

We're talking about the so -called amplification problems.

Read, write, and space amplification.

Because if your system, your business is defined by ever -growing data ingest rates, understanding how LSM trees manage this tension is, well, it's absolutely essential.

All right.

Let's start with section one and dig deeper into that why.

Why immutability?

It seems counterintuitive for a database, right?

A system designed for constant change is relying on files that never change.

It does seem strange at first, but beyond the sheer speed of sequential writes, you get these huge benefits in integrity and safety.

Okay.

Explain that.

How does never a file make it safer?

Well, once an immutable file segment, a data block is written to disk, it's permanent.

It is never modified, so you have an inherent guarantee of data integrity for that block.

But the real win is with concurrency.

Oh, okay.

In a mutable B tree, if one thread is trying to, say, split a node while another thread is trying to read from it, you need incredibly complex, fine -grained locks or latches all the way down the tree structure to prevent corruption.

Right.

Because the very structure of the building is changing while someone is trying to walk through it.

You have to lock all the doors.

Precisely.

But since LSM disk structures don't change, the right path, the process of creating new falls and the read path, the process of reading existing ones, they just don't conflict at the disk level.

So reads can just access the files without worrying about locks on the disk pages themselves.

Which dramatically improves read parallelism and frankly just makes the code so much simpler to write and reason about.

Let's really hammer home this difference in I -O patterns because it's the root of everything.

B trees are plagued by random I -O.

Yes.

On mechanical hard drives, it's just catastrophic.

You're physically moving a disk head that's measured in milliseconds.

Sequential reads are microseconds.

It's orders of magnitude.

And even on modern SSDs, random I -O is less painful.

But it's not free.

The drive controller still has to figure out the physical location and it might have to rewrite entire flash blocks internally.

We'll get to that later.

Yeah, that's the whole log stacking problem.

But for B trees, every read, every update, every single maintenance operation like a node split or a merge, it all requires finding and modifying these fixed size pages at random spots on the disk.

And the cost is just so high.

Even if I just want to change a single byte in my record, the B tree works on fixed pages.

So I

LSM trees just sidestep that whole problem.

They combine two techniques, buffering the latest changes in memory and then using append only storage on disk.

So they batch up a bunch of small random client writes in memory and then write them all out in one big, beautiful sequential firehose blast to disk.

It completely avoids the need to locate the old record on disk for every update.

It's just so much more efficient at saturating the disk's bandwidth.

Okay, so let's try to define this thing.

An LSM tree is a disk resident sorted structure, a bit like a B tree in that sense, but it behaves totally differently.

Where does the merge part of the name come from?

Right, so the structure itself is optimized for that sequential access.

The nodes are usually 100 % full, meaning we don't waste space reserving room for future updates like B trees do.

The log structured part of the name comes from log structured file systems, which also write everything to an append only log.

And the merge.

The merge is the price you pay.

It's the crucial complexity.

LSM trees write these immutable files and then in the background they are constantly merging them together.

It's like a giant continuous merge sort running all the time.

And that merging does two things.

It reclaims space by finally getting rid of the old outdated records.

And it makes future reads faster because you have fewer files to check to find the correct value for a key.

Okay, let's unpack the mechanics then.

Section two.

You said the secret sauce is this idea of buffering.

The LSM tree actively defers writing to disk.

It holds everything in memory first.

It is a critical design choice.

It's how the system converts that chaos of many small random write operations from clients into one large orderly sequential write to disk.

This conversion is the fundamental source of the high throughput that LSM trees are famous for.

Let's see the parts.

Starting with that in -memory component.

The mem table.

The mem table is the front door to the system.

It's immutable in -memory structure like a balanced tree or skip list that takes all incoming reads and writes.

And the genius part is that any update to the mem table has zero disk IO cost.

It's just a memory operation.

That immediately brings up the durability problem.

If the server crashes, all that data in memory is gone.

How do we tell a your write succeeded if it's not on disk yet?

We lean on a classic database concept, the write ahead log or wall.

Okay.

Before we even touch the mem table, the data record is

obtended sequentially, of course, to the wall file on disk.

Only after that wall write is confirmed as durable, do we update the mem table and then acknowledge the write back to the client.

So the wall becomes the source of truth for recovery.

If we crash, we just replay the wall to rebuild the data.

For that efficient merging, the mem table has to keep its contents sorted in memory.

Then when the mem table hits certain size threshold, it gets flushed.

Correct.

And the result of that flush is a new disk resident table.

These are the immutable, append -only files we've been talking about.

Once written, they're only ever used for reads.

They are never, ever modified.

Now, the source material here lays out two different models for how this LSM tree works.

Yeah, this one is conceptually simpler.

You have your memory component and then just one big single disk component.

That disk component looks a bit like a B -tree, but all its pages are 100 % full and read -only.

And when the mem table is ready to flush.

The system has to merge the contents of that in -memory structure with the corresponding parts of the existing tree on disk.

So it reads the old section from disk, merges it with the new data from memory, and then writes that combined result out to a brand new disk segment.

It's sort of like a copy -on -write operation, but amortized over many writes.

That's a great way to put it.

But this process demands really strict atomicity.

During that flush, three things have to happen perfectly.

Okay, what are they?

First, as soon as the flush starts, any new writes have to go to a brand new empty mem table.

Second, during the merge, both the old disk segment and the flushing mem table have to stay available for reads.

And third, the final step, publishing the new merged segment and atomically discarding the old one, has to happen in a single indivisible step.

Figures 7 -1 and 7 -2 in the text really show this.

You see the old structure, then poof, it's gone and replaced by the new merged one.

But this model has a huge drawback, which is why you almost never see it in practice.

High write amplification.

Way too high.

Every single time you flush the mem table, which could be very frequent with high read, you trigger this big merge and rewrite operation on your one disk component.

It just creates too much maintenance overhead and negates the benefits we were trying to get.

Which brings us to what everyone actually uses.

The multi -component LSM tree.

Yes, this is the architecture you see in RoxDB, Cassandra, all the big ones.

Instead of merging into a single structure, you just flush the entire mem table's contents out as a brand new, completely separate disk resident table.

So you just keep adding new files.

You just keep adding new files.

And the immediate consequence is, your read cost goes up.

Because now to find a key, you have to go look in all these different files.

We've solved the right amplification from the flush, but we've just created a read amplification problem.

Exactly.

And the solution to that is the constant background janitor process, compaction.

This is the merge from the name.

This is the merge.

Compaction is the workhorse.

It periodically picks a bunch of smaller disk tables, reads them all, merges their contents, throws out all the old and deleted data, and then writes the clean combined result out to a new, larger file.

Once that's done, it atomically discards the old files it started with.

Figure 7 .3 shows this whole life cycle really well.

Data starts in the mem table, a flush moves it to a bunch of small disk resident tables, and then compaction combines those into a bigger, cleaner, compacted table.

It's a continuous cycle.

And we should probably detail the specific states a component goes through, like in Figure 7 .4.

It all starts with the current mem table, which is taking all the live client writes.

When that gets full, there's an atomic switch.

Right.

That mem table transitions into the flushing mem table state.

It stops taking new writes, but crucially it has to remain readable.

At the same instant, a new, empty, current mem table takes over for new writes.

And that flushing mem table gets written out to what the text calls the on -disk flush target.

Yep.

And that file is incomplete.

It's not part of the active read set yet.

Only when it's fully written to disk does the flushing mem table get discarded.

And that new file becomes a permanent flush table instantly available for reads.

And the durability guarantee comes from the wall, right?

The write -ahead log truncation.

That's the final synchronization point.

The wall segment that corresponds to the mem table you just flushed can only be deleted after that new file is safely on disk.

If you delete it too early and crash, you lose data.

This sequence ensures your recovery guarantees.

And then finally, these newly flushed tables, along with some older ones, get picked up by the compactor, become compacting tables, and are merged into a clean compacted table.

It's this beautiful dynamic dance of data from memory down to disk.

It really is.

And since the lsm tree is append only, the basic write operation is incredibly simple.

An insert and an update are, from the system's perspective, the exact same thing.

We just call them upserts.

You don't have to do a complicated read -modify, right?

You just write the new version.

You just write.

Which is fantastic for performance, but it makes other things, especially deletes, a lot more complicated.

Okay, let's tackle that.

Section three.

Updates, deletes, and reconciliation.

So if a client sends a delete for key x, why can't we just remove x from the mem table?

Because an older version of key x probably still exists in one of the 10 or 20 disk resident tables that are sitting underneath the mem table.

And if we just delete the latest version from memory, the next time someone reads that key.

The read process will scan all the tables, miss it in the mem table, but then find that old value on disk, think it's the current one, and serve it back to the client.

It effectively resurrects the data.

Data resurrection.

It's a failure mode.

So deletes have to be recorded explicitly.

We don't remove data.

We add data.

You insert a special marker.

A special marker entry, yeah.

It's commonly called a tombstone.

It's just a record that says key x was deleted at this time stamp, and you insert it into the mem table just like any other write.

So now that tombstone has the highest time stamp for that key.

When read comes along and sees it, it knows to ignore any older physical records for that key.

It might find deeper in the disk files.

Exactly.

And this gets even more complex when you talk about bulk deletes.

The book details range tombstones, which is a feature you see in systems like Apache Cassandra.

Right, where you can say something like delete from table, where key is between K2 and K4.

How did that work?

Instead of one tombstone, the system writes two special markers, a start and an end tombstone that define that deleted range.

And then the read process has to apply that deletion predicate across all the different files it's reading from.

Which can get tricky, I imagine, if the key ranges in those files overlap with the tombstone range in weird ways.

Absolutely.

The book gives a good example.

Let's say disk table one has keys K1, K2, K3, and K4.

And a newer disk table two has a range tombstone that covers everything from K2 up to, but not including K4.

So when you read, the reconciliation process has to merge those two.

And it has to be smart enough to apply that range tombstone from table two against the actual records in table one.

The final result you show, the user should only contain K1 and K4.

K2 and K3 are hidden.

They're suppressed by that newer range tombstone.

So whether it's an update that creates a new version or a delete that creates a tombstone, the final decision is always made by this reconciliation process.

That's the key.

And the only way it can work is if every single record insert, update, or tombstone carries metadata with it.

Specifically, a timestamp.

And this all happens on the fly during the read.

Dynamically, yeah.

The read process scans all the different sources for a given key.

It looks at all the versions it finds, compares their timestamps, and the one with the highest timestamp wins.

That's the one that gets returned.

All the older shadowed records are just filtered out on the spot.

This brings us to the biggest challenge, though.

How to read efficiently.

Section four.

If a key could be in the current mem table, or the one that's currently flushing, or any of the, say, 20 disk tables, how do you find it without just randomly checking 21 different places?

You can't.

But you have to remember the golden rule, the invariant of the whole system.

Every single one of those tables, in memory and on disk, is internally sorted by key.

And that's what saves us.

That is the superpower.

It allows us to use a really efficient algorithm called the multi -way merge sort algorithm for all our lookups and

visualizing it is really the key to understanding how LSM reads work.

We have N different sources, so N iterators, one for each.

How do we combine them into a single sorted result without just loading everything into memory?

We use a simple but really clever data structure right in the middle of it all.

A priority queue, which is usually implemented as a min heap.

Okay.

And that queue holds, at most, N items.

Specifically, it holds the head element, the smallest current key from each of our N active iterators.

So if we have, say, five disk tables and two mem tables to check, that's seven iterators.

The priority queue holds just seven keys, one from each source.

Exactly.

And the nature of a priority queue, or a min heap, is that the element at the very top is always guaranteed to be the globally smallest key across all seven of those sources.

So we just pull that smallest key off the top of the queue.

That's the next item in our result set.

We return it to the client, and then immediately we have to replenish the queue.

You go back to the specific iterator that you just took the key from.

And you pull its next key and put that into the priority queue.

The queue then resorts itself, which on average takes O log N time, where N is the number of files we're merging.

It's this continuous cycle of pull, replenish, resort.

And it ensures the final output stream is always perfectly globally sorted.

I love that O log N cost.

Even if you have 100 tables you're merging, the cost of finding the next single item is logarithmic, which is incredibly fast.

And the memory overhead is just O N because you're only ever storing one item per table.

Figure seven to five in the text, give the perfect visual for this.

Imagine you have three separate sorted streams of data, all feeding into the central sorting funnel.

That's a priority queue.

And it spits out one single perfectly sorted stream on the other side.

That's the magic of the merge iterator.

This is also where the conflict resolution happens, right?

If the queue happens to get two records with the exact same key, one from the mem table and one from roll disk table.

That's the signal.

That's when reconciliation is triggered.

As those two same key records come through the queue, the system compares their metadata.

It checks the timestamps, finds the newest one.

Sends that one to the output.

And just discards the older shadowed version on the spot.

So the entire merge iteration process handles both the sorting and the reconciliation at the same time.

It's incredibly elegant.

Okay.

So we've established that the price for high write speed is this potential for high read amplification because we have so many disk tables.

And that's why compaction isn't just a nice to have is the indispensable continuous maintenance that keeps the system from grinding to a halt.

It is the LSM trees version of defragmentation and garbage collection all rolled into one.

So section five, maintenance and compaction strategies.

Let's just review the physical process one more time.

It's that same merge iteration algorithm we just talked about.

But instead of being triggered by a client's read query, it's triggered by the system's own internal logic.

It picks a set of source tables.

It iterates over their entire contents, applies all the merge and reconciliation rules to get rid of duplicates and old data.

And then it sequentially writes that clean, dense result out to a brand new larger table.

And there are some really important operational constraints here.

First, while you're compacting those tables, they still have to be available for reads, right?

They do.

The system can't just go offline for maintenance.

And second, you need to have enough free disk space to temporarily hold both the old source files and the big new merged file before you can finally delete the old ones.

Space amplification in action.

Exactly.

And for performance, you usually want to run multiple compactions at the same time, but they have to be working on totally separate non intersecting sets of tables to avoid any kind of corruption.

Let's talk about a really critical detail here.

Tombstone preservation.

Why can't we just get rid of a tombstone as soon as it does its job during a compaction run?

This goes right back to that fear of data resurrection.

It all depends on where data lives in the storage hierarchy.

Imagine you're compacting a new table from level zero, let's call it L0, with a slightly older table from level one or L1.

L0 has a tombstone for key X.

L1 has the old physical value for key X.

The compaction runs, sees the tombstone as newer, and correctly does not write the old value of X to the new output file.

So far, so good.

But what if a much older version of key X still exists way down at the bottom in L3, which wasn't part of this compaction?

I see.

If we drop the tombstone during that L0 -L1 compaction, a later read could miss the key in the new merged file.

Keep searching, find that ancient copy in L3, and bring it back from the dead, because the tombstone that was supposed to kill it is now gone.

Exactly.

The system has to be 100 % certain that a tombstone has outlived all possible older versions of that key across all files at all levels before it can be safely thrown away.

So how do systems manage that?

Different strategies.

RocksDB, for example, only allows a tombstone to be discarded once it has been compacted all the way down to the bottom -most level of the storage hierarchy.

At that point, you know for sure no older versions can exist.

And Cassandra?

Cassandra uses a time -based approach, a garbage collection grace period.

It's a configurable amount of time, say 10 days.

It assumes that in a distributed system, that's enough time for the tombstone to replicate everywhere and make sure all nodes have seen the deletion before it's safe to actually purge it.

Alright, let's get into the actual strategies for how this compaction is scheduled and managed.

Strategy one, leveled compaction, which was really popularized by RocksDB.

Leveled compaction is very structured.

It organizes all the disk tables into indexed levels, L0, L1, L2, and so on.

The highest index number is the oldest, bottom -most level.

Data starts at L0, which is where the fresh fleshes from the memtable land.

Right.

And a key architectural point is that at L0, the tables can have overlapping key ranges, because they're just raw dumps from memory.

But once you get to L1 and beyond, there's a strict new rule.

A strict invariant.

The key ranges of the tables within a single level must be non -overlapping.

So when data moves from L0 to L1, the compactor has to split it up into these discrete, sorted, non -overlapping files.

Correct.

And then a typical compaction will take a file from, say, L1, and merge it only with the specific files in the next level down, L2, that its key frame actually overlaps with.

Figure 7 -6 in the text shows this migration process really clearly.

And why is that non -overlapping rule so valuable?

It's a huge win for reducing read amplification.

If you're searching for a specific key, once you get past L0, you can look at the key range metadata for L1 and know that at most one table in that entire level could possibly contain your key.

You can skip all the other tables in L1.

You can skip all of them.

It makes point lookups and range scans so much faster than having to every single file.

And these levels usually grow exponentially in size.

L1 might be 100 mb, L2 a gigabyte, L3 10 gigabytes.

It naturally pushes older, colder data down into the bigger, lower levels.

Okay, so that's the structured approach.

What about the simpler one, strategy 2, size -tiered compaction?

Size -tiered is, well, simpler.

It just groups files together based on their size.

It merges a bunch of small files to create a medium -sized file, then merges a bunch of medium files to create a big file.

This can lead to higher write amplification, right?

Because the same piece of data might get rewritten over and over as it moves up through the size tiers.

It can, and it has this one famous vulnerability you mentioned, table starvation.

Explain that.

Starvation happens when the biggest, oldest tiers just stop getting compacted.

Imagine you have a lot of deletes or updates to recent data.

The compactions at the smaller tiers run just fine, but the resulting files are small.

They never get big enough to trigger a compaction with the huge files in the top tier.

So that top tier just sits there, getting older and messier, full of outdated records and tombstones that never get cleaned up.

And your read costs go through the roof because you always have to check those big, messy files.

It often gets so bad that systems have to implement a forced compaction just to go in and clean up those starved levels.

And finally, strategy 3 is a more specialized one, time window compaction.

Yeah, this is perfect for time series data or any data where you set a TTL a time to live.

Instead of worrying about size or levels, it just groups data into files based on time windows like one file for every day's worth of data.

And the optimization here is huge.

It's massive.

If the system knows that an entire file contains data from, say, 31 days ago, and your TTL is 30 days, it doesn't have to read and rewrite a single record from that file.

It just deletes the entire file.

It drops the whole thing.

It completely avoids the compaction and rewrite overhead, which is an enormous win for managing historical data that just needs to expire.

Okay.

We've spent this whole time talking about the problems and tradeoffs of log -structured storage.

Let's formalize this now in section 6 with the amplification tradeoffs.

There are really three core costs that you have to accept if you want that high write throughput.

First is read amplification.

That's the need to check multiple tables and the mem table just to find a single record.

And that directly translates to higher read latency.

Second, write amplification.

The fact that every byte the client writes gets rewritten on disk, you know, 5, 10, 20 times as it gets compacted down through the levels.

And third, space amplification.

The system has to temporarily store all those redundant copies of the same key, all the old versions plus all the tombstones, until compaction finally gets around to cleaning them up.

And this whole balancing act is summed up perfectly by what the text calls the ROM conjecture.

The ROM conjecture is just a great mental model.

It defines the fundamental limits here.

It says you have three factors you can tune.

Read overhead, R.

Update overhead,

U, which is basically write cost.

And memory overhead, M.

And the core idea is there's no free lunch.

You cannot optimize all three at once.

If you push down on two of them.

The third one is guaranteed to pop up.

And if you look at B -trees versus LSM freeze through this ROM lens,

their design goals become incredibly clear.

Totally.

B -trees read optimized.

Their R is low.

But to get that, they have a high update cost because of all the random I -O.

And a high memory and space overhead because they have to reserve all that empty space in their nodes.

So U and M are high.

And LSM trees are the mirror image.

They're write optimized.

Their U is incredibly low.

That's the whole point.

And their initial M can be low too.

But that comes at the direct cost of a high R, a high read overhead, because of the multiple tables you have to check.

The entire game of designing a good LSM tree is about finding clever ways to reduce that R.

It's also really important to distinguish where the write amplification comes from in each design.

It's not the same mechanism.

Not at all.

In a B -tree, write amplification comes from overwriting the same physical page on disk over and over again.

In an LSM tree, it comes from the migration of data.

A record is written once in L0, then it's rewritten when it moves to L1, then rewritten again when it moves to L2.

It's a predictable migration cost, not a random overwriting cost.

The RUM conjecture just forces the engineer to admit that all system performance is a managed imbalance.

LSMs accept that they are trading latency for throughput, and then they dedicate the rest of their architecture to fighting the consequences.

Let's move from the high -level theory to the gritty implementation details.

Section 7.

Those disk resident tables we keep talking about.

In most systems, they have a common name.

Sorted string tables or SS tables.

It's a very descriptive name.

They hold key -value pairs sorted by key, and they're written sequentially in a single pass during a flush or compaction.

Right.

And an SS table is usually made up of two physical files.

You have the data file, which is just the raw key -value pairs, concatenated one after another, and then you have the index file.

And the index file is what gives you fast lookups.

Exactly.

The index is often a little B -tree or hash map that's built just for that SS table.

It holds the keys and a pointer, usually a byte offset, to where the full data record starts in that big data file.

So the index doesn't have the actual value, just the location.

Just the location.

This gives you logarithmic lookup time to find the starting point, but once you're there, you can do incredibly efficient sequential range scans just by reading forward from that offset in the data file.

That seems straightforward enough for the primary key.

But what about secondary indexes?

They're essential for almost any real application.

You can't have a single global secondary index because the underlying data files are immutable.

You'd have to constantly rebuild it.

So instead, systems use what are called SS -stable -appatched secondary indexes, or SASI, like in Cassandra.

So the secondary index is tied directly to the life cycle of its SS -stable.

Precisely.

When a mem table flushes and creates, say, SS -stable number 5, the secondary index for the data in SS -stable 5 is built and written to disk at the exact same time.

It's as immutable as the data it's indexing.

Of course, you also need a separate mutable index in memory for the data that's currently in the active mem table.

You do.

This coupling ensures consistency.

Now, let's circle back to the most critical optimization for fighting that read amplification, the Bloom filter.

The problem is, even with key range metadata, you might still have to open a file just to find out the key you're looking for isn't in there.

Which is a huge waste of I .O.

A Bloom filter is this beautiful, space -efficient, probabilistic data structure that gives you a very fast maybe or definitely not.

And the key property is that it can give you false positives.

It might say a key could be in the file when it isn't.

But it guarantees no false negatives.

That's the magic.

If the Bloom filter says the key is not in the file, it is 100 % guaranteed to be absent, and the system can just skip reading that file entirely, saving a huge amount of work.

Okay, walk us through the mechanics from figure 7 to 7.

How does this work?

It's surprisingly simple.

When you build the SS -stable, you also create a big array of bits, all initially set to 0.

Then you take every key in that SS -stable, and you run it through several different hash functions.

Okay.

Each hash function gives you an index, a position in that bit array.

And for each key, you go to those positions, and you flip the bit from 0 to 1.

And then for a lookup.

You take the key you're searching for and run it through the exact same hash functions.

You look at those positions in the bit array.

If all of the bits at those positions are 1, then you say, okay, the key is probably in this file.

I need to go check.

But if even one of those bits is still a 0, then you know with absolute certainty that key was never added, because if it had been, that bit would be a 1.

And you skip the file.

It's brilliant.

And since the SS -stable is immutable, you know the exact number of keys up front.

So you can tune the size of the bit array and the number of hash functions to get a very low predictable false positive rate.

That's incredible.

Let's move back up to the in -memory component, the mem table.

It needs to keep its keys sorted.

While you could use a B -tree, the book highlights another structure that's very popular here, the skip list.

Yeah, skip lists are often preferred because they're just simpler to implement than a fully balanced binary tree, especially a concurrent one.

They give you the same logarithmic search and insert performance, but use probabilistic balancing instead of all those complex tree rotation algorithms.

How does a structure that's basically a linked list get logarithmic performance?

It's a linked list with multiple levels.

It has these express lanes.

When you insert a new node, you randomly assign it a height.

Taller nodes get linked into the higher level, express lane lists.

Figure 7 -8 shows this really well.

So when you search, you start on the highest express lane.

You start at the top.

You follow the pointers on that highest level until you're about to overshoot the key you're looking for.

Then you drop down one level and continue your search forward from there.

You keep dropping down levels until you find your key on the bottom most level, which connects all the nodes.

It's effectively halving the search space with each level, which is how you get that O log N performance.

Exactly.

And while making them concurrent is tricky, it's very doable with things like compare and swap operations and careful memory management techniques like hazard pointers to make sure one thread doesn't delete a node that another thread is currently looking at.

Okay, last couple of details.

Physical disk access.

LSMs use the page cache just like B -trees, but because the data on disk is immutable, you don't need any extra locks for concurrent threads reading the same cached page.

Which is a nice simplification.

But there's one key difference.

Unlike B -trees, LSM data records are often not page -aligned.

What does that mean?

It means a single record can start on one 4KB page and end on the next one, as figure 7 .9 shows.

The SS table index just uses an absolute byte offset, not a page ID.

This improves data density, but it means that to read one record, you might have to load two cages from disk.

It's a small trade -off.

And what about compression?

It seems easy to apply since you're writing the SS table in one big pass.

It is, but it creates a similar alignment problem.

A compressed block of data is almost never gonna be exactly 4KB.

So you can't just jump to a page index.

So how do you find the data?

You use another layer of indirection.

Figure 7 .10 shows this.

It's called an offset table.

This little table, created when the SS table is written, stores the on -disk offset and size of each compressed block.

It maps the virtual uncompressed page numbers to the actual physical location of the compressed data.

Okay, so far we've been laser -focused on ordered log -structured storage.

The whole system is built around sorted files and merging.

But now in section 8, we're gonna look at a radical alternative.

How?

What if you just didn't sort the data at all?

Right.

Some designs decide to just go for maximum write speed by storing records in whatever order they arrive, like get rid of the mem table, the sorting, all of it.

We call this unordered log -structured storage.

And by doing this, the main storage file itself just becomes the wall.

There's no need for a separate one.

Exactly.

And the classic example of this design is bitcask, which was the storage engine for the React database.

So in bitcask, you just append every new key value pair sequentially to a log file.

But if the data is completely unsorted on disk, how on earth do you find anything quickly?

You keep a massive index in memory.

Bitcask maintains an in -memory hash map called the keydir.

And this keydir stores a pointer, the file ID, and the byte offset to the physical location of the latest version of every single key in the database.

Wow.

So a write is just one append to the log file and then one quick update to a pointer in that in -memory hash map.

That's it.

Pure sequential I .O.

And a read for a single key is incredibly fast.

You do one lookup in the in -memory keydir, which gives you the exact location.

Then you do one disk seek to that spot and you read the value.

No merging, no multiple files to check.

What about cleaning up old versions?

Compaction still has to happen, right?

It does.

The Bitcask compactor reads through the old log files and for every record it finds, it checks the keydir.

If that record's location is the one currently pointed to by the keydir, it's live.

If not, it's stale.

And it only writes the live records to a new clean log file.

And then discards the old ones.

Figure 711 shows this cleanup process.

It's very simple and effective.

So super fast point queries, super fast writes.

What are the trade -offs?

Because there always are.

Two huge ones.

First, that keydir, which has an entry for every single key, must fit entirely in WAM.

That puts a hard limit on the number of keys your database can hold.

Second, and this is a big one, because the data is in random order on disk, Bitcask cannot do efficient range scans.

To scan a range, you'd have to look up every key in the range in the keydir and then do a separate random disk IO for every single one.

It completely defeats the purpose.

And that limitation is what led to the design of Whiskey.

It was an attempt to get the best of both worlds.

Right.

Whiskey's big idea was key -value separation.

It recognized that usually keys are small, but values can be huge.

So Whiskey keeps all the small keys along with a pointer to their value in a standard sorted index lsm tree.

But the large values themselves are stored separately in these big, unordered, append -only files called VLOGs or value logs.

So you get the efficient sorting and range scans of an lsm tree for the small keys.

And you get the super high write throughput of append -only storage for the big, bulky values.

That's the goal.

And compaction is much more efficient because you're only rewriting the tiny key pointer pairs in the lsm tree, not the huge values over and over again.

But, as figure 712 makes clear, the read cost for range scans comes roaring back.

It does.

You can scan the sorted key lsm tree very efficiently.

But for every key in your range, you then have to follow its pointer and do a random IO into the VLOGs to fetch the actual value.

A scan of a thousand keys means a thousand random IOs.

And garbage collecting those VLOGs sounds like a nightmare.

It is the main challenge.

The VLOGs themselves have no idea what data is live and what's stale.

So to clean them up, the system has to scan the entire key lsm tree just to figure out which value pointers are still active.

It's a huge engineering and performance hurdle compared to a standard lsm compaction, which just compares timestamps.

Okay, let's zoom out.

We've talked a lot about the complexity inside one of these systems.

But now, in section 9, we need to talk about concurrency and a much bigger system level problem.

The long stacking issue.

Right.

The concurrency challenges inside the lsm tree are all about managing the view of the data atomically as it moves from mutable memory to a mutable disk.

Let's just quickly review those synchronization points from earlier because they're so critical.

One, the memtable switch.

When a memtable is full, rights have to atomically switch to the new one.

Two, flush finalization.

The new disk file has to atomically replace the old memory view.

And three, wall truncation.

You can only delete the wall segment after the data is fully durable on disk.

If you get the sequence or atomicity of any of those wrong, you can lose data or serve inconsistent results.

Exactly.

So even one log -structured system is complex.

But the problem is, it's not operating in a vacuum.

It's usually stacked on top of other log -structured systems.

The application's lsm tree is running on a file system log, which is running on an SSD,

which has its own flash translation layer, ftl log, inside the drive itself.

It's logs all the way down.

Let's focus on that bottom layer, the ftl and the SSD.

Why does the hardware itself use log -structured storage?

It's because of the physics of flash memory.

First, it buffers small random writes from the OS into bigger, more efficient physical writes.

And second, flash has these weird constraints.

You can only write to a page that's been erased.

And pages are grouped into large blocks that have to be erased all at once.

So if I just want to update a single page, the ftl can't overwrite it.

It has to write the new version of that page to a new empty spot and just mark the old one as invalid.

Figure 713 shows this mapping.

Exactly.

And when the SSD starts running out of empty blocks, it has to do its own garbage collection.

It finds a block that has a mix of valid and invalid pages,

copies all the still -live pages to a new block, and only then can it erase the old block to reuse it.

And as Figure 714 shows, that relocation process is itself a form of write amplification, but at the hardware level.

It is.

And on top of all that, the ftl is also managing wear leveling, trying to spread the writes out evenly across all the memory cells to make the drive last longer.

So we have the application's lsm tree doing compaction and creating write amplification.

It's writing to a file system log, which is writing to the ftl log, which is doing its own garbage collection and creating more write amplification.

And none of these layers are talking to each other.

They're all operating blindly.

And this leads to really specific performance problems, like the misalignment shown in Figure 715.

Totally.

A single large file written by the database might span several blocks at the file system level.

Later, the database deletes that file.

It thinks it did one clean operation,

but the file system and the ftl underneath just see a bunch of fragmented holes appear, which might force them to do a lot of extra work rewriting neighboring data that was perfectly fine.

And even if the database thinks it's doing perfect sequential I .O.

The OS might be interleaving multiple sequential streams at once.

Maybe you're writing the wall at the same time as you're writing a new S table.

As Figure 716 shows, those two parallel streams can get fragmented at the physical hardware level and turn into something that looks a lot more like random I .O.

Which is why the general advice is always align your partitions to the hardware and try to align your application rights to the SSD's native page size.

It helps, but it doesn't solve the fundamental problem.

So the log stacking problem makes it clear that just layering individually optimized systems on top of each other leads to global inefficiency.

The solution, as Section 10 calls it, is mindful stacking.

You need the layers to coordinate.

And a fantastic example of this is the LMS storage subsystem, which is used under a data structure called the BOO tree.

LMS stands for Latch Free Log Structured Access Method Aware.

That last part, access method aware, seems like the key.

It is.

The BOO tree, as we've discussed before, doesn't want to find nodes in place.

It appends little update records called delta nodes.

Right.

It chains them off the base node.

Now, a dumb storage layer would just write those delta nodes to its log sequentially, maybe scattering them all over the disk.

But because LMS is aware of the BOO tree semantics when its own internal garbage collection runs, it knows what it's looking at.

It knows those are all updates for the same logical node.

So it can be smart.

It can take all those separate little delta nodes, apply them to the base node, and create one single, clean, fully updated base node.

And then it writes that one consolidated node sequentially to a new location.

The benefit is huge.

It reduces the number of IOs you need to read that node later.

It saves space by getting rid of the old deltas, and it reduces physical fragmentation.

It's all because the storage layer was given just enough semantic information from the layer above it.

So that's the software coordination solution, but there's a more radical alternative.

Just get rid of the layers entirely with open channel SSDs.

This is the take full control approach.

The application engineer basically bypasses the file system and the FTL, and talks directly to the raw flash.

It sounds incredibly complex.

The application now has to do its own wear leveling and garbage collection.

It is immensely complex.

But the payoff is that you eliminate at least two layers of redundant logging and write amplification.

You are in complete control.

And the goal is perfect alignment with the hardware.

These systems, sometimes called software -defined flash, they expose this asymmetric IO interface, where the size of the write unit that your application uses is designed to match the hardware's erase unit size, the block size.

For a log -structured system, this is the holy grail.

Because your application's logical segments are now physically aligned with the hardware's erase blocks.

Which can dramatically reduce that hardware -level write amplification caused by the FTL having to move live pages around.

The application is in full control of data placement, which is the ultimate form of mindful stacking.

So to wrap this all up, this deep dive has really shown that log -structured storage isn't just one database technique.

It's this pervasive design pattern you find all the way up and down the system stack, from your database engine right down to the silicon in your SSD.

And LSM trees are really the definitive, modern answer to just overwhelming write volumes.

They make a very explicit trade.

They accept higher read amplification and space overhead in exchange for truly superior write performance.

But they need all this sophisticated background machinery compaction, bloom filters, just to keep those costs from spiraling out of control.

They absolutely do.

And the ultimate tension we've explored today is this constant dilemma in system design.

Do you hide all that complexity behind simple, generic APIs, which is what the file system and the FTL try to do?

Or do you expose the underlying details to allow for better coordination and higher performance like we saw with LMA and Open Channel SSDs?

It seems like we often think of software as being totally independent of the hardware it runs on.

But the biggest performance gains often come when the software becomes acutely hardware -aware, when it tailors its logical operations to the physical realities of the

The programmer recycles a flash, the page alignment of disks, all of it.

So if getting the best performance requires this deep integration, here's a final thought for you to mull over.

What other low -level hardware design concepts should software engineers be paying closer attention to in order to maximize performance across the entire system stack?

A fascinating question to explore as you continue your own deep dive into database internals.

We hope this has helped you understand how to build smarter, faster systems.

Thanks for joining us.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Log-structured storage systems fundamentally reimagine data persistence by adopting an append-only architecture where new data is continuously written sequentially to disk rather than modifying existing records in place. This architectural shift, particularly evident in Log-Structured Merge Trees, creates a write-optimized alternative to traditional B-Tree based systems by leveraging the observation that sequential disk writes vastly outperform random access patterns on modern hardware. Incoming writes initially accumulate in a memory-resident structure called a memtable, typically implemented using skip lists to maintain sorted order while enabling efficient retrieval and range scans. Once the memtable reaches capacity, it is flushed to disk as an immutable Sorted String Table, creating a multi-level storage hierarchy where successive layers contain progressively larger files with non-overlapping key ranges. The inevitable accumulation of overlapping files triggers compaction, a background process that merges multiple files to eliminate redundancy, reclaim disk space, and prevent query performance degradation as the number of levels increases. Deletion operations in immutable systems present distinctive challenges, resolved through tombstone markers that logically remove keys without physically erasing data until compaction explicitly purges obsolete records. Read operations must navigate this distributed structure through multiway merge procedures that reconcile records across multiple memtables and disk levels to locate the most recent version of any requested key. The RUM conjecture characterizes the inherent trade-offs between read amplification, update amplification, and memory amplification, describing how optimization in one dimension necessarily increases costs in others. Bloom filters serve as probabilistic data structures that dramatically reduce read amplification by probabilistically determining key absence, eliminating unnecessary disk seeks while maintaining bounded false positive rates. Specialized variants including Bitcask for latency-critical point operations and WiscKey's key-value separation strategy demonstrate how system design adapts to different access patterns and storage hardware characteristics. Understanding log stacking, the interaction between multiple log-structured layers across database engines, filesystems, and SSD flash translation layers, proves essential for avoiding unexpected write amplification cascades and achieving predictable performance in production environments.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 7: Log-Structured Storage

Related Chapters