Chapter 3: Storage & Retrieval Systems

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are moving past the abstract data models, the theoretical side of things, and getting our hands dirty.

We're going right into the trap of database design,

storage, and retrieval.

There's this great old German proverb I love that really captures the core problem here.

It says, if you keep things tidily ordered, you're just too lazy to go searching.

That's perfect.

It is, isn't it?

That's the conflict we're exploring today.

How do we actually structure the files on a disk so we can store new data, but also, and this is critical, get it back quickly later on?

Exactly.

We're shifting our focus completely.

We're moving away from the data model, which is how your application sees the data, and we're looking at the internal mechanics.

We're talking about the storage engine.

And this is something application developers really need to care about, right?

It's not just for the database admins.

Oh, absolutely.

Because your choice of storage engine and how you tune it, that basically sets the performance ceiling for your entire application.

A system that's built for, say, fast online transactions is just fundamentally different from one built to run huge analytical reports.

So our mission today is to compare the two main families of these storage engines.

First, the old standard, the B -tree, which you find everywhere, and second, the newer log -structured systems that are optimized for writes.

And then we're going to tackle that essential divide in the data world, transaction processing versus analytics,

OLTP versus OLEP.

We're giving you the shortcut to understanding the physical architecture that powers, well, pretty much everything you use.

Okay, let's unpack this, starting with the world's simplest database.

All right.

Imagine a key value store.

You can implement it with just two basic commands.

Let's call them deep set and dpject.

Okay.

So when you call deep set with a key and a value, all it does is it appends that pair to the very end of a file.

That's it.

Like a diary, you just write the next entry at the bottom of the page, precisely.

And that operation is incredibly fast.

It's a sequential write, which is the fastest thing you can do on almost any storage, whether it's a spinning disc or an SSD.

This append -only file is what we call a log.

It's brilliant for writing.

But, and this is a huge but, the consequence for reading is, well, it's catastrophic.

It's terrible.

To find the latest value for any given key, your DevoGOT command has to read the entire file from the very beginning to the very end.

Performance just falls off a cliff.

We call that an ON operation.

Wait, okay.

For anyone who hasn't heard that term since a college computer science class, what does ON mean in plain English here?

It just means the time it takes to find your data grows linearly with the size of the database.

So if you double the amount of data, the end -year lookups will take twice as long.

If your file's a terabyte, a single lookup could take forever.

It's just, it's not scalable.

Right.

And that's why we have to introduce some complexity.

We need an index.

The index is that extra piece of metadata.

It's a signpost, really.

It lives off to the side, and it tells you where to find your data on the disk so you don't have to scan everything.

And this brings us right back to that proverb, the trade -off.

It's the core trade -off.

Maintaining an index adds overhead every single time you write, because when you write the data, you also have to update the index.

So a good index makes your reads way faster.

Grammatically faster.

But every single index you add makes your writes a little bit slower.

It's an unavoidable tax.

Okay.

So let's zoom in on that first family of engines, the log -structured ones, which are built to embrace that fast sequential write.

The simplest example is a system like BitCask, which uses an in -memory hash index.

This is a really clever, simple idea.

The system keeps a hash map, just a standard data structure, entirely in your computer's RAM.

The key in your database maps to a byte offset, which is just a number telling you its exact location in that big log file on disk.

So a read becomes super fast.

You check the map in RAM, which is almost instant, get the location, and then do just one single disk seek to grab the value.

Yep.

Performance is great.

But there's a big catch.

All of your keys have to fit in memory.

The values on disk can be huge, that's fine.

But the index itself, all the keys, has to fit in RAM.

Which is perfect for certain workloads, right?

Like if you're counting clicks on a set of a few million URLs, you're just updating the same keys over and over again, very quickly.

Exactly.

Now, if we only ever append to that log file, we're going to run out of disk space pretty fast.

Because you're keeping every old version of every value.

Right.

So the solution is to break the log into segments.

Once a file segment reaches a certain size, you close it, it becomes immutable or frozen, and then a background process kicks in, called compaction.

Compaction is like, what, spring cleaning for your database?

That's a great way to put it.

A background thread just goes through these old frozen segments, throws away any duplicate or old keys, keeps only the most recent value for each one, and then merges them all into new, clean, and much smaller segment files.

And what about deleting data?

You can't just remove something from the middle of an append -only file.

You can't.

So instead, you write a special deletion marker.

It's called a tombstone.

When the compaction process sees that tombstone, it knows to discard that key and all its old values for good.

Okay, this is all very clever, but it feels like we're still missing a piece.

This brings us to the next big evolution.

SS stables or sorted string tables.

Yes, and this is where the design gets really powerful.

The idea is simple, but the consequences are huge.

We now require that within each segment file, all the key -value pairs have to be sorted by key.

And why is that so important?

Well, it unlocks two massive advantages.

First, merging becomes incredibly efficient.

Think about the mergesort algorithm.

You can take several huge sorted files and merge them into one new sorted file without loading them all into memory.

You just read them side by side.

And the second advantage.

This one seems like the real game changer.

It is.

It allows for sparse indexing.

Because the data on disk is sorted, your in -memory index no longer needs to know the location of every single key.

Ah, okay.

So it's like an index at the back of a book.

It doesn't list every word, just the important ones and the page number.

Exactly.

Your index might only store an offset for say one key every few kilobytes.

So if you're looking for the key JFK and your sparse index has pointers for Jackson and Johnson, you just jump to the Jackson offset and scan forward a tiny bit.

That scan is super fast and the memory savings are enormous.

So this whole system where new writes go into an in -memory tree called a mem table, which then gets flushed to disk as a sorted SS table, that whole architecture has a name, right?

It does.

It's known as a log structured merge tree or LSM tree for short.

And of course, to make sure you don't lose the data in that in -memory mem table if the power goes out.

You need another log, a write ahead log.

You always need another log.

Every write goes to that disk log first just to be safe so you can rebuild the mem table after a crash.

Okay.

That brings us to the other side of the coin.

The second and much older family of storage engines, B -trees.

This is the standard, right?

Post -gresco, MySQL, most traditional databases use this.

They do.

And B -trees have a totally different philosophy.

They are the tightly ordered part of your proverb.

They don't use logs and segments.

Instead, they break the database into fixed size pages, usually about four kilobytes.

And their main operation is to update data in place.

They overwrite pages.

The structure is a tree, as the name implies.

You start at a root page, which has keys and pointers to child pages.

You follow the pointers down based on the key range you're looking for until you hit a leaf page, which has the actual data.

Right.

And the number of pointers on a single page, the branching factor, is huge, often hundreds.

This means the tree is very wide and not very deep, usually only three or four levels deep, even for massive databases, which gives you very fast, predictable lookups.

O log.

Oh, but if you're overwriting data in the middle of a file, that sounds risky.

What happens if the power cuts out midway through an update?

You'd get a corrupted file.

That's the key danger.

And that is why B -tree systems absolutely need a write -ahead log or wall.

It's even more central to their design than in an LSM tree.

So every single change to the B -tree before it touches the main data pages has to be written to this separate append -only wall first.

If the system crashes, that log is used to restore the B -tree back to a consistent state.

No exceptions.

Here's where it gets really interesting.

We've laid out the two designs.

So who wins?

Let's talk performance and that hidden killer of modern storage,

write amplification.

Write amplification is a critical concept, especially for SSDs, which can literally wear out.

It's this effect where one single write from your application results in multiple actual writes to the physical disk.

For a B -tree, that would be the write to the wall and then another write to the actual page, right?

And maybe more if a page has to split.

Exactly.

And for an LSM tree, the same piece of data might get rewritten over and over again during compaction and merging.

So based on that, LSM trees seem to have a clear advantage for write -heavy workloads.

They do.

Their whole design is based on fast sequential writes.

They generally have much higher write throughput than B -trees.

They also tend to compress better and create smaller files.

This is why systems like Cassandra or SillaDB use them.

They're built to ingest enormous streams of data as fast as possible.

But there's a downside.

A big one.

The compaction process can be unpredictable.

It runs in the background, but it uses disk .io, and it can interfere with live requests, causing these weird latency spikes.

And if your writes come in too fast, compaction can fall behind, the number of segments balloons, and your read performance just dies.

Whereas B -trees are the opposite.

They're predictable.

Much more predictable.

Each key exists in exactly one place.

Reads are consistently fast.

This is why they're still the standard for transactional databases where you need those strong guarantees and low predictable latency.

This feels like it's leading us to a fundamental split, not just in technology, but in what we're even using the database for.

That's the real pivot point.

We have these two completely different workloads, and trying to serve both with one system is a recipe for mediocrity.

You're talking about OLTP versus OLFP.

Exactly.

OLTP, online transaction processing, is what you think of for a normal web application.

It's user -facing.

Lots of small random reads and writes looking up one user's shopping cart.

The bottleneck is disk -seek time.

You need indexes to find that one little piece of data instantly.

And then there's OL8, online analytic processing.

Which is for business intelligence.

These are huge complex queries that scan millions or billions of rows to calculate aggregate sums,

averages.

What were our total sales by region last quarter?

The bottleneck there is pure disk bandwidth.

You just need to shovel as much data as possible through the system.

And the industry standard solution is just to not mix them.

Don't mix them.

You build a separate data warehouse, which is a database optimized purely for analytics.

You use a process called ETL Extract, transform loadable data from all your OLTP systems into this one central place.

And in that warehouse, the data is often structured in a special way, like a star schema.

Right.

You'll have a giant central fact table that's all your events, like sales.

And it's surrounded by smaller dimension tables that add context.

Who, what, where, when.

Your customer dimension, your product dimension.

But that fact table could have hundreds of columns.

And a typical analytic query might only need, what, four or five of them?

Exactly.

And that is the fundamental flaw of a traditional row -based database for analytics.

It stores all 100 columns for a single row together.

So you get the five columns you need, you have to read and process all 100.

It's a massive waste of IO.

Which brings us to the big innovation for analytics, column -oriented storage.

The idea is just so elegant.

Instead of storing a whole row together, you store all the values from a single column together.

All the product IDs are in one file.

All the quantities are in another.

All the dates in a third.

And if your query only needs those three columns, it only reads those three files.

The IO savings must be gigantic.

They're huge.

And it gets better.

Because the data in a single column is often very repetitive, it becomes incredibly compressible.

You can use clever techniques like bitmap encoding.

How does that work?

Well, say a column has 10 different possible values.

You create 10 separate bitmaps, one for each value.

Each bitmap is just a long string of ones and zeros showing which rows have that value.

And then complex queries become simple math.

They become bitwise logic.

A query for product equals A and store equals B is just a lightning fast bitwise A and D operation between two of those bitmaps.

It is orders of magnitude faster than a traditional scan.

And this column -based layout is also a huge win for the CPU, right?

Not just the disk.

It enables something called vectorized processing.

The query engine can operate on these tight compressed chunks of column data, all in the CPU's fastest cache, using special instructions.

It turns the CPU into this highly efficient data processing pipeline.

We should mention, though, that if you sort the data, say, by date, that order has to apply to the entire row across all of those separate column files.

That's a key point.

And some advanced systems, like Vertica, will even store the same data sorted in several different ways to optimize different kinds of queries.

And how do these column stores handle writes?

Because updating a bunch of compressed column files sounds like a nightmare.

It is a nightmare.

So what do they do?

They go right back to the LSM Tree strategy.

They batch up new writes in memory, and then flush them to the on -disk column format in large, sorted batches.

The log -structured approach makes a sneaky comeback.

And that brings us to the end of this deep dive.

We've really covered the four cornerstones here.

The OLTP versus OLEP split,

the B -trees update -in -place model, the log -structured merge -trees append -and -compact model, and finally, the bandwidth efficiency of column -oriented storage.

And understanding these structures is just.

It's essential for any developer or architect.

It's what helps you choose the right tool for the job.

Are you building a transactional system?

A B -tree is probably your best bet, ingesting massive amounts of event data.

You should be looking at an LSM Tree.

So what does this all mean?

Let me leave you with a provocative thought.

The logic of both B -trees and LSM Trees, all this focus on pages and segments and compaction, it's almost entirely designed around the limitations of slow, spinning -platter hard disks.

Right.

It's all about minimizing slow, random I -O.

So what happens to all of this architecture when technologies like huge in -memory databases and, more importantly, non -volatile memory NVM become standard?

That's the billion -dollar question.

If you have memory that's persistent, that doesn't lose data when the power goes off, and it's as fast as this very concept of a disk I -O bottleneck, does it disappear?

We might be on the verge of having to rethink database architecture from the ground up.

Interesting future to consider.

Thank you for diving deep with us today.

We hope this helps you build your next system just a little bit better.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Efficient storage and retrieval mechanisms form the foundation of modern database systems, with distinct architectural patterns emerging based on workload characteristics. OLTP systems prioritize fast, individual transaction processing through indexing structures designed for rapid point lookups and updates. Log-structured storage leverages append-only sequential writes paired with in-memory hash indexes to achieve high write throughput, though this approach requires periodic compaction to reclaim space and consolidate overwritten entries. The LSM-Tree advances this model by organizing data into sorted segments called SSTables, enabling sparse indexing and efficient background merging of data segments. This design excels at write-heavy workloads but incurs write amplification, where data passes through multiple compaction cycles before final storage. B-trees represent the dominant indexing approach in production systems, employing a page-oriented structure that organizes information into fixed-size blocks arranged in a balanced tree hierarchy. This design guarantees predictable lookup performance through a bounded number of disk seeks, though maintaining correctness during concurrent updates requires write-ahead logging mechanisms and explicit concurrency controls. OLAP systems address fundamentally different requirements, demanding optimized retrieval of large aggregated datasets across many records for analytical queries rather than transactional updates. Data warehousing separates analytic workloads from transactional systems through extract-transform-load pipelines that populate dedicated analytics databases. These systems typically employ star schemas, where a central fact table references multiple dimension tables, organizing high-dimensional data for efficient exploration. Column-oriented storage proves particularly advantageous for analytic queries by storing all values from a single column contiguously, dramatically reducing the I/O bandwidth required when accessing only relevant fields. This columnar layout enables superior compression techniques such as bitmap encoding and run-length encoding, transforming storage density while facilitating vectorized processing on modern processors. Despite write performance challenges in purely columnar architectures, hybrid approaches integrate LSM-tree mechanisms for handling recent changes. Query performance accelerates further through materialized aggregates including precomputed views and data cubes, which precalculate dimensional rollups to minimize computation during common reporting operations.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 3: Storage & Retrieval Systems

Related Chapters