Chapter 3: File Formats

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today we're doing something a little different.

We are, uh, intentionally avoiding the glamour.

Yeah, no distributed systems today.

No high level SQL query optimization,

no massive sharding architectures.

We are going underground.

We're going deep into the foundational layers of what makes a database durable, what makes it fast in the first place.

We're talking about the physical file format itself.

Exactly.

That's right.

I mean, if you want a database that is truly reliable, one that can actually survive a sudden crash and is fast enough to handle high transaction volumes, you really can't treat the disk as just, you know, an endless bucket.

You can't just throw data at it.

No, you have to become what we're calling the on disk architect.

You have to design the internal organization of the bytes down to the, I mean, the very last detail.

And our mission for this dive is really focused on that core engineering challenge, the one that all persistent storage systems face, especially indexing structures like B -trees.

How do engineers successfully bridge that, that massive complicated gap between volatile main memory, which is fast and flexible,

and then the persistent organized storage on a disk?

It's a huge gap.

We're going to unpack the fundamental principles you need to build efficient, mutable on disk data structures.

And we're drawing specifically from our source material on database internals.

And the key distinction here is, well, it's profound and it really defines the entire problem space.

When you, as a typical application developer, access data in main memory, those accesses are largely transparent.

The OS handles it for you.

Exactly.

Thanks to the operating system's virtual memory management, you rarely have to manually worry about pointer offsets or physical memory locations, but accessing the disk.

Totally different story.

It requires explicit system calls, specifying precise byte offsets inside a file, and then on top of that, manually interpreting that raw binary data.

If you want reliability and speed, you have to design the data layout specifically for the disk.

And that means taking over the resource management the OS typically does for you.

Okay.

So let's start there.

Let's establish the motivation for why we even need these incredibly detailed, highly specific, and let's be honest, often complex file formats.

I think the source material draws an excellent analogy here.

It's like working in unmanaged memory, like low level C programming.

Oh, that's a perfect comparison.

It is.

Designing a file format is fundamentally very similar to using a low level language without any of the standard library conveniences.

You're handed a fixed block, your page or your file, and you have to slice and dice it exactly how you want.

You're on your own.

Completely.

You're dealing with fixed size structures,

explicit allocation and deallocation within that block, and you have to do manual pointer tracking to make sure everything stays consistent.

There's no built -in malloc or free that's going to magically tidy up after you.

And I think this leads to the most crucial difference, the one that really dictates all the design choices that follow.

On disk, unlike in managed memory, database systems have to handle fragmentation and garbage collection completely by themselves.

In main memory, we can usually just allocate more memory pretty quickly, and we don't worry as much about contiguous segments or internal fragmentation because the operating system is constantly moving things around for us.

Correct.

When you free memory in a high level language, the system just handles the cleanup.

It typically merges adjacent free blocks.

But on disk, if a data record gets deleted, the space it occupied is simply marked as free.

It just leaves a gap.

A hole in your page.

A hole.

And that gap remains useless until you, the database system architect,

explicitly design a mechanism, a tracker, an availability list, some kind of garbage collector to track and reclaim that space.

So this is why the data layout on disk is so paramount.

Absolutely.

If you want an efficient disk resident data structure, you have to lay out the data in ways that allow immediate quick access.

This means designing compact binary formats and ensuring that serialization and deserialization, that translation between structured data and raw bytes, is basically instantaneous.

OK, so let's put this back into the context of B -trees, which are the fundamental mutable data structure for most relational databases.

The core complexity really boils down to page management, doesn't it?

100%.

The high -level theory of a B -tree, the balanced searching, the basic splits and merges, that's all necessary, but it's not sufficient for a practical implementation.

The real engineering work is on the page.

It's all in the page.

How you compose data structures within that fixed -size page, how you navigate between pages, and how you place and update all the internal pointers.

Since B -trees are designed to be highly mutable—I mean, records are constantly being inserted, updated and deleted—the file format has to be exquisitely designed to manage these physical changes without sacrificing search speed or, you know, wasting space to fragmentation.

The physical layout is the performance engine.

OK, we've set up the challenge.

It's manual resource management on -desk.

But before you can even worry about the high -level architecture of pages, we have to zoom way in down to the absolute foundation.

Binary encoding.

How does a simple number or a string actually look when it's physically written to disk?

These are the primitives, right?

The absolute atoms of your data.

Every key, every value, whether it's an integer, a timestamp or a big text field, it has to be serialized, first converted into an interpretable sequence of bytes before it can be written to the file.

And then deserialized on the way back.

And then deserialized translated back into the original data when it's read.

And the efficiency of those two operations, that really sets the upper bound on your database's performance.

Hashtag, tag, tag, tag, A, primitive types and endianness.

OK, let's start with fixed size values.

Most numeric types, a byte, a short, an int, a long, they're all represented as fixed size values.

You know, 8, 16, 32, 64 bits.

But even here, with something that seems so simple, we immediately hit a major architectural headache.

The endianness problem.

Yeah, endianness is all about the sequence of bytes when you're dealing with any multi -byte value.

And consistency here is, it's non -negotiable.

If the machine that writes the data uses one byte order, and the machine that reads it uses a different one, the number you get back is completely garbled.

Flipped and forant.

Totally flipped.

OK, so we've got the two major types, big endian and little endian.

Can you walk us through the difference with the concrete example?

And remember, you can't see a diagram here.

Sure, certainly.

Let's take a 32 -bit integer.

In hexadecimal, let's say it's 0xAABTCCDD.

So here, AA is the most significant byte, the MSB.

It carries the highest numerical weight.

And DD is the least significant byte, the LSB.

Got it.

OK, so big endian stores the MSD first at the lowest memory address.

If the starting address is A, then address A holds AA, A plus 1 holds BB, and so on.

So it's like reading a number left to right.

The most important digit comes first.

Exactly.

It's very intuitive.

Network protocols often default to big endian for that reason.

And then little endian just reverses that order.

It reverses it.

Little endian starts with the LSB, the least significant byte at the lowest address.

So using the same value, address A would hold DD, A plus 1 would hold CC, and the most significant byte, AA, is stored last.

Most modern Intel and AMD processors are little endian.

And the implication for database files is, well, it's immediate and it's severe.

If a database file is written in big endian and you try to read it on a standard little endian machine without any translation,

the machine will start by reading DDD first.

So instead of seeing 0xAABBCCD, it reads 0xDDDCCBBAA.

Which is a completely different number.

It's catastrophic data corruption.

And this is why database systems cannot ignore platform specifics.

The source material mentions Rocks SDB as a practical example.

They and many others have to include platform checks, something like little endian.

If the platform's native byte order doesn't match the required byte order for the file format, they have to use a transformation function.

A byte swapper.

A byte swapping function, often called endian transform, it explicitly reads the sequence of bytes in reverse order to correctly reconstruct the value.

This routine is a fundamental, mandatory part of the file reading pipeline.

You have to have it for correctness and portability.

Hashtag tag tags tag B.

Floating point numbers, IEE 754.

Okay, so moving on to float and double types.

They're also fixed size, you know, 32 or 64 bits, but their internal structure is a lot more complex than a simple integer.

Right.

They adhere to the very rigorous IEE 754 standard, and this standard defines how they're stored using three dedicated parts.

The sign, the exponent, and the fraction, which is also called the mantissa.

So for a single precision 32 -bit float, how do those bits break down?

You get one bit for the sign, so positive or negative.

You get eight bits for the exponent, which determines the magnitude, you know, how big or small the number is, and then the remaining 23 bits are for the fraction, which determines the precision.

So unlike a straightforward integer where the bits are just a direct power of twosome, the floating point structure partitions those 32 bits very carefully.

It's all to maximize the range and precision of non -integer numbers.

And while standard libraries usually handle the encoding and decoding for you, the crucial conceptual trade -off for the database architect is understanding the nature of the value itself.

What do you mean by that?

A floating point number is fundamentally an approximation of the real number.

It's not an exact value.

This is a vital contrast with integers, especially if you're dealing with financial data or geometric calculations or any comparison where exact equality is required.

Hashtag, tag, tag, tag, see.

Variable -sized data, strings and arrays.

Okay, now we get to the real challenge for file architecture.

Variable -sized data.

Numeric types are neat and fixed, but strings, JSON documents, arrays of bytes, they change size all the time.

How do you serialize that efficiently?

Well, if you don't know the size of an item, you don't know where the next item starts.

And that makes sequential reading slow and random access is almost impossible.

The standard approach, the one preferred in high -performance binary formats, is the Pascal string format.

Okay, walk us through the structure of a Pascal string.

Why is it so much better for on -disk persistence?

A Pascal string is serialized in two key adjacent parts.

First, you get a fixed -size integer that specifies the total length of the data that's about to follow.

And then immediately after that, you have the raw byte data of the string itself.

So like size plus data.

Exactly.

For example, if we use a 16 -bit integer for the length, the format is size.

Ute 16 plus data, byte size.

And since 16 bits can store up to 16 ,535, you instantly know the maximum possible length of that string.

And the performance advantage there is just massive.

Oh, it's huge.

It delivers over 100 of all our performance for length lookups.

To find out how long the string is, you only have to read the first two bytes.

It doesn't matter if the string itself is 10 bytes long or 60 ,000 bytes long.

You avoid iterating through the entire content.

Which is a huge win.

And if you're working in a modern programming language, you can read the size, immediately jump to the calculated end of the string, and then pass that precise memory slice directly to the string constructor.

That high -speed slicing capability is indispensable.

And this is in sharp contrast to the classic C approach, the null -terminated string.

Right.

Null -terminated strings rely on the reader consuming data byte by byte until it hits a special end of string symbol, usually a null byte, zero.

Which is simple to implement.

Incredibly simple in C.

But checking the length requires sequential iteration.

It makes random access and quick length checks impossible.

And in database structures where you frequently need to compare the length of keys before you even decide on a full byte -by -byte comparison, null -termination is generally a severe performance bottleneck.

You just avoid it and optimize storage engines.

Hashtag, tag, tag, D, bitpack data.

Efficiency via compression.

OK.

Efficiency is everything.

Wasting space means slower reads and slower writes.

We mentioned how using a whole byte for a boolean, which only needs a one or a zero, is just incredibly wasteful.

And the technique to combat this is bitpacking.

Bitpacking.

It's crucial for compaction.

Instead of using eight bits for one boolean, developers will batch up to eight boolean values into a single byte, dedicating one bit to each flag.

This drastically cuts down on overhead, especially in structures where you might have hundreds of records per page.

And this same philosophy extends to things like enums and flags.

Exactly.

Enums are for low cardinality values.

Values that only have a small number of possible states.

Like you know, a beechy node type can only be a root, an internal, or a leaf node.

So you just represent those as small integers.

Right.

Zero x arrow, zero zero by zero, one zero by zero, two.

They stay compact and are really easy to read.

And flags let us represent complex, non -mutually exclusive state, all within a single numeric field.

So you can pack a bunch of different settings into one integer.

Yes.

We can combine multiple binary settings, like whether a value is stored in line, if the overflow data into one integer variable, it acts like a register holding many separate pieces of information at the same time.

Okay.

There's a crucial implementation detail here.

The rule that all flag masks must be based on powers of two, so one, two, four, eight, and so on.

Why is that specific rule mandatory?

Because a power of two is the only way to guarantee that the mask has exactly one set bit in its binary representation.

For example, four is zero one hundred in binary.

If you used a non -power of two mask, like three, which is zero zero one hundred in binary, you risk overlapping the positions of two different flags.

So powers of two ensure that when we combine these flags using a bitwise OR, each flag occupies its own unique non -overlapping position, and that separation is key to correctly isolating them later.

Okay.

So let's get into the low -level algorithms for managing these packed bits.

If I want to set a flag, how does the system do that?

So you set a bit for a specific flag.

You use the bitwise OR operator, the symbol, combined with the flag's power of two mask.

The OR operation ensures that if the bit is currently zero, it becomes one, and if it's already one, it just stays one.

And to unset it, to clear the bit, it's a little more complex.

It involves negation.

That's right.

To unset a bit, you use the bitwise AND operator in CoPro, combined with the bitwise negation of the mask.

The bitwise negation flips every bit in the mask.

So if the mask is zero zero zero zero zero zero one hundred, the negation is one one one one one one zero eleven.

Exactly.

And when you AND that negated mask with your flag's variable, every bit except the one you targeted remains untouched.

But the targeted bit is guaranteed to become zero.

It effectively clears the flag.

And finally, testing if a flag is set.

To check if a flag is set, you just perform a bitwise AND between the flag's variable and the mask, and then you check if the result is non -zero.

If it's non -zero, it means the mask overlapped with the set bit in the variable, which tells you the flag is active.

This surgical manipulation of individual bits is really the foundation of compact binary storage.

OK, we've established how to encode the bytes.

Now let's zoom out again to the full file.

We have to take those primitives and organize them into meaningful structures that can live permanently on disk.

And the first critical design decision you have to make is your addressing strategy.

For mutable, index -based storage structures, the files are always split into same -sized pages.

This is the cornerstone of predictability.

Why is that uniform page size so crucial for optimizing disk access time?

Because disk access is inherently slow.

We have to minimize the number of explicit system calls we make.

If every page is, say, 8 kilobytes, the database buffer manager knows that to read page N, it just needs to calculate a simple fixed offset, N times 8 PB.

A constant time calculation.

Right.

It bypasses the need for the database to track complex internal file fragmentation at the high level.

And this fixed block size aligns perfectly with how operating systems handle disk I .O., optimizing for batch reads and writes.

So the overall file structure follows a pretty standard pattern, then.

It does.

A file typically begins with a fixed -size header and sometimes ends with a fixed -size trailer.

These sections hold metadata that you need to access instantly, like the file version, the magic number, maybe pointers to the root nodes at the index, and then the vast majority of the file, the body, is just composed of these uniform pages.

Hashtag tag fixed schema benefits.

Okay, and if the database is relational and it enforces a fixed schema, a consistent number, order, and type of fields for all records, that gives storage engineers a massive advantage.

Oh, a huge advantage.

It allows for an extreme reduction in the stored data overhead.

If the schema is fixed, you don't need to repeatedly store field names or data types alongside every single record.

You just use positional identifiers?

Exactly.

The system knows the first four bytes are the employed, the next three bytes are the date, and so on.

This positional dependency significantly cuts down on the storage required per record, making the file highly compact.

Hashtag tag hashtag handling mixed fixed variable data.

But even with a fixed schema, you have to account for records that have a mix of fixed -size fields, like integers, and variable -size fields, like strings.

The layout has to allow quick access to all of them.

And this internal record design is critical.

The professional approach is to strictly group all the fixed -size fields together at the head of the structure.

Like in the company directory example.

Fields like employee, tax number, date, gender, they all get fixed, pre -computed space at the very start of the record layout.

And then immediately following that fixed section, you put the variable -size fields, like the first and last name bytes.

And the key optimization is how we find those variable fields quickly.

Yes.

We don't want to just rely on calculating their position based on all the preceding data.

That would involve a bunch of dependent calculations.

Instead, we use the fixed -size area to store metadata about the variable fields.

So you store the length of the variable fields inside the fixed part.

Exactly.

You store first name length and last name length within that fixed header section.

So if you want to retrieve the last name, you don't have to sequentially read the first name.

You can jump to the fixed header, read the length of the first name, use that to calculate the exact offset, and then just slice the bytes for the last name.

It's one fast operation.

It's all about decoupling the metadata from the data itself.

Precisely.

That decoupling guarantees fast random access within the record, even if the records themselves are complex.

And if you store the offset itself in the fixed header, that's even faster.

It's a direct address jump.

Hashtag tag tag building hierarchies and navigation.

So file formats, as we're seeing,

involve composing these structures hierarchically.

Yeah.

Bytes becoming fields, fields becoming cells, cells becoming pages, and pages making up the file.

Right.

And this entire architecture relies on some kind of high -level navigation.

Since a typical database file has many distinct parts, index pages, data pages, different storage regions, the header itself, you need a reliable way to quickly find the start offsets of those parts without scanning the entire multi -gigabyte file.

Which is where a high -level lookup table comes in.

Correct.

Whether this lookup table lives in the main file's header or maybe in a separate file, its whole purpose is to aid immediate navigation.

It points directly to the start offsets of the major file segments, allowing the database to jump to the root node or any specific data region instantly when it opens the file.

Without that map, you'd be stuck with linear access time to find your starting point, which is just intolerable for performance.

We now arrive at the core architectural challenge within the B -tree.

The page structure itself.

Pages are these fixed -size units, usually 4 to 16 kilobytes, that are used for all operations.

And it's important to remember, in this context, the B -tree node and the physical page are often used interchangeably.

An internal non -leaf node holds keys and pointers to child pages, while a leaf node holds the actual key and data pairs, hashtags, tag, tag, a, basic page structure, fixed -size records.

So the earliest and simplest page organization was just simple concatenation, especially for fixed -size records.

Yeah, this approach was really simple to conceptualize.

It was just a linear sequence of elements packed together.

If it's an internal node, it might be a sequence of pointer key, pointer key, and so on.

And if the keys and pointers were all fixed size, this worked reasonably well for sequential reading.

But in a highly mutable database, the downsides become obvious almost immediately.

They're devastating to performance.

If you need to insert a new key or update an existing one anywhere except at the very end of the page, you are forced to physically shift or relocate all the subsequent data to make space.

Linear operation.

A linear operation proportional to the amount of data remaining, which quickly becomes unacceptable under high rate load.

And second, and maybe more critically for modern data, this simple concatenation approach just cannot efficiently handle variable -size data.

It requires a much more flexible architecture, hashtag, hashtag, B, the problem of variable -size data and fragmentation.

OK, let's go a bit deeper on fragmentation.

If we delete a 500 -byte record from the middle of a page, that space is just left behind.

It is.

And if the next record you insert is only 400 bytes, a 100 -byte gap remains unused.

Over time, as records are inserted and deleted with varying sizes, the page becomes peppered with these tiny useless gaps.

That's internal fragmentation.

And even splitting the page into fixed segments doesn't really solve it.

No, that just wastes space unless your record size is perfectly divisible by the segment size.

This cumulative waste reduces the effective capacity of the page and forces page splits to happen much earlier than they need to.

And the really big constraint that drives this solution here is pointer stability.

If we just rewrite the page and move records around to get rid of the gaps, we break any external references pointing to them.

Exactly.

Pointers referencing these records from outside the page, say an index pointing to a data record on a leaf page, those have to remain stable.

If we move the physical location of the record, we'd have to update all those external pointers, which is computationally prohibitive and a nightmare for concurrency control.

So the ultimate constraint is this.

We need dynamic, flexible space management without changing the reference used by external pointers.

Hashtag tag tag C, introducing the slotted page structure.

And the answer to this trifecta of challenges, variable size, fragmentation, and stable pointers is the slotted page structure.

This is also known as a slot directory or a slot array.

This is really where the engineering brilliance lies.

The slotted page achieves a crucial decoupling.

Imagine the page is physically divided into three main logical regions.

First, you've got a fixed size header for metadata.

Second, a cells region, which holds the raw physical variable size data records.

And third, the pointers or slot directory region, which is a fixed size array holding and offsets that point into that cells region.

OK, so let's describe this clever spatial architecture for the listeners since they can't see the diagram.

Imagine the page as a horizontal block.

The fixed header is anchored on the far left side.

Immediately after the header, the slot directory begins and it grows inward toward the center of the page.

From left to right.

But crucially, the actual data cells start at the far right edge of the page and they grow inward from right to left to meet the pointers.

So the free space is that volatile middle ground, that gap between the two growing regions.

Correct.

When a new record is inserted, its physical data cell is simply appended on the right, eating up space from that free region.

A new offset pointer is added to the slot directory on the left, also eating up free space.

The real genius, though, is what happens when you delete a record.

Right.

Deleting a record only requires you to mark the pointer in the slot directory as deleted.

You might set the offset to zero or just remove the entry entirely.

The actual physical data cell can be left in place at least initially.

And this is what satisfies the constraint of stable references.

It does.

External references use the stable slot ID, the index number in that pointer array, which now acts as an indirection layer.

It abstracts away the cell's unstable physical location.

So the slotted page gives us minimal overhead, easy space reclamation through defragmentation, and a dynamic layout where external references are stable, even if the data inside the page moves around.

It lets us reference records without relying on their physical addresses.

If the page gets heavily fragmented, we can just start a defragmentation process, physically move the cells around freely, consolidate the free space, and we only have to update the internal offsets within the slot directory.

The external index pointer, which only cares about, say, slot ID number five, never needs to know the difference.

And that stability is absolutely critical for simplifying complex things like transaction rollback and crash recovery.

OK.

Let's refocus on the structure of the cell itself, the actual encoded data record inside the slotted page.

We need to maintain an assumption of uniformity within a page for efficiency, don't we?

We have to.

The assumption is that all cells within a given page are homogeneous.

They're all either fixed size or variable size, and they all contain the same basic structure.

For example, they're all key only, or they're all key value pairs.

And why is that uniformity so important?

It allows us to store the structural metadata, like the common format, or whether values are variable size just once in the page header, instead of duplicating that overhead in every single cell.

It's a huge space saver.

Hashtag tag tag key cell layout for internal nodes.

So for an internal B -tree node, the cell only needs a key and a pointer to a child page.

How is that layout optimized?

Well, since the key can be variable size, the cell needs to clearly delineate its parts.

So we structure it by grouping the fixed size fields first, an integer for the key size, and an integer for the child page ID.

And these are followed by the variable key bytes themselves.

And again, it's that strategy of grouping fixed fields first that allows for predictable access.

Precisely.

We can access the key size and page ID instantly using fixed, pre -computed offsets relative to the start of the cell.

Only the variable key data requires a calculation based on that key size field we just read.

This design is fundamental to keeping key lookup operations fast and deterministic.

Hashtag tag tag key value cell layout for leaf nodes.

Now leaf nodes are more complex because they hold the actual data record, the value, instead of just a pointer.

So what extra fields do we need to manage this extra data?

The key value cell has to track the size of both the key and the value.

The structure typically starts with a byte for flags, followed by fixed size integers for both key size and value size.

And these are immediately followed by the variable bytes for the key and then the variable bytes for the data record itself.

So by putting both length indicators in the fixed header, we know exactly where the boundaries of both variable parts lie before we even start reading them.

Exactly.

Hashtag tag tag page ID versus cell offset.

Okay, we've used the terms page ID and cell offset a lot.

It's really crucial to clarify their distinct roles in the overall file architecture.

It is.

The page ID is the file -wide identifier.

It's used to locate child nodes and to navigate across the entire multi -gigabyte index file.

But critically, the page ID is not the raw physical file offset.

It's an abstraction.

It gets translated to the actual physical offset on disk using a high -level mapping table that's managed by the buffer manager or the page cache.

And you need this level of indirection because file segments might be moved or relocated by the underlying operating system.

And in contrast to that, the cell offset is highly localized.

Yes, the cell offset is a page local pointer.

It's always relative only to that page's start offset.

And since a page is a fixed small size, you know, 4KB or 8KB, we can use a much smaller cardinality integer, maybe a 16 -bit short integer, to represent the offset.

Which keeps the representation compact.

It keeps it compact, saving precious bytes on disk, which adds up massively when you have billions of cells.

Hashtag tag tag locating variable data within a cell.

So if we have a complex cell storing both a variable key and a variable value, how do we efficiently extract those components?

It's all about calculated skipping.

You rely entirely on the information in the fixed header.

Once you access the cell using its slot offset, you read the fixed header.

To locate the key, you skip the fixed header bytes and you read key size bytes.

Simple enough.

And to locate the value, you skip the fixed header plus the key size bytes and then you read value size bytes.

The file format is engineered so that we always have the precise size information required to surgically slice the variable subparts and reconstruct the original data record in constant time.

Let's circle back to the dynamic heart of the B -tree.

The slotted page.

We establish the growth direction.

Cells growing inward from the right, offsets growing inward from the left.

Right.

The diagram in the source, figure 3 -3 -6, confirms this architecture.

Cells are appended to the right side, consuming the right edge of the free space.

Cell offsets are kept on the left, consuming the left edge of the free space.

And that free space is just dynamically shrinking in the middle.

Hashtag, tag, tag, maintaining key order, logical versus physical.

This architecture leads to what I think is the most important insight of the slotted page design.

The physical layout of the data cells and the logical order of the keys are completely decoupled.

This is the genius stroke.

It really is.

Physical storage is prioritized for speed.

Cells are just appended in insertion order, which requires minimal effort.

But the logical sorted key order, which is necessary for efficient searching with a binary search, is maintained strictly by sorting the list of cell offset pointers in the slot directory.

Okay, let's use the insertion example from the source material to trace the movement.

We insert Tom, then Leslie.

Okay.

Physically, the Tom cell is stored first, at a lower offset.

Leslie's cell is appended after Tom, at a higher offset.

However, if we demand alphabetical logical order, so Leslie before Tom, the offset array will contain the pointer to Leslie's cell first, followed by the pointer to Tom's cell.

So the physical arrangement is irrelevant for searching.

Totally irrelevant.

The sorted offset list dictates the logical path.

Okay, now we insert Ron.

Ron's data cell is just appended to the physical end of the cell's region, but the logical order has to be preserved.

Leslie, Ron, Tom.

Exactly.

The physical cell for Ron is tacked onto the free space boundary.

In the slot directory, however, we have to find Ron's correct alphabetical position.

Since Ron goes between Leslie and Tom, the existing pointers after that insertion point, so Tom's pointer, are physically shifted one position to the right to create a gap for Ron's new pointer.

And Ron's pointer is inserted into that gap.

Right.

The array now reads, Leslie's pointer, Ron's pointer, Tom's pointer.

This ensures that a binary search across the offset array will always lead us to the correct logically sorted key, even though Ron's physical data cell might be placed far away from Leslie's and Tom's, hashtag, hashtag, managing fragmentation and reclaiming space.

So this decoupled design allows for easy deletion without physical movement, which addresses our initial fragmentation problem.

When an item is deleted, the cell is just marked as deleted, leaving a gap.

But we can't just ignore that space forever.

No, we have to track it.

And this is done using an availability list, sometimes called a free blocks list.

This structure, which can live in memory or be stored in the page header, keeps meticulous track of the start offsets and sizes of all the freed memory segments within the page.

So if the page has occupied cells and a bunch of small, fragmented free spaces,

the availability list is the map that points into those gaps.

Right.

And when a new cell arrives, the database has to consult this list and find the optimal place to store the new data.

And this is where fixed strategies come into play.

The choice here really impacts long term efficiency.

OK, let's spend a minute on the tradeoffs between first fit and best fit.

Why would an engineer choose one over the other?

First fit is the fastest strategy computationally.

It just takes the first suitable segment it finds in the availability list that's big enough to hold the new cell.

Quick and dirty.

Very.

But if the segment it finds is much larger than needed, the remainder is left behind.

And that remainder might be too small to ever be useful for future inserts.

So first fit prioritizes insertion speed, but it leads to higher long term internal fragmentation.

So first fit tends to leave a lot of small wasted chunks of space.

Best fit tries to optimize that.

Best fit attempts to find the segment in the availability list that leaves the smallest remainder possible.

It actually searches the entire list for the best match.

This optimizes overall page utilization, ensuring that free space is consolidated as much as possible.

But the tradeoff is time.

The tradeoff is time.

Checking every segment in the list adds a slight computational cost during insertion.

But ultimately, storage systems often favor best fit because maximizing page capacity postpones the necessity of those really expensive page splitting operations.

And what happens when fragmentation gets so bad that no single contiguous free block is large enough for a new cell, even though the total available space is sufficient?

That's the trigger for defragmentation.

If the database can't find a consecutive segment that's big enough, it initiates a vacuuming process.

It reads all the live, non -deleted cells,

rewrites them contiguously to one side of the page, eliminating all the internal gaps and consolidating all the free space into one large usable block.

And only then can it try the insert again.

Only after defragmentation is complete, and if, even after all that, there's still not creating an overflow page to hold the oversized data.

Defragmentation is expensive, which is why those efficient fit strategies are so important.

And to link this back to our Why It Matters discussion, the stability that the slotted page design gives you simplifies crash recovery immensely.

I mean, think about it.

If the database crashes while a page is being defragmented, a process that involves moving the physical cells around, you don't lose the entire index structure.

The external index entries are still pointing to the stable slot IDs.

When the system restarts, it can use the metadata in the page header to identify that the page is mid -operation and just discard the incomplete physical changes, because the logical pointers, the slot IDs, were never compromised.

Okay, so we've engineered the file perfectly.

But systems exist in a dynamic world.

Hardware fails, software evolves.

This last section has to address how database file formats ensure they remain readable over time and safe from corruption.

Right.

This is all about long -term persistence and survivability.

As database systems evolve, file formats inevitably change.

New features are added, optimization strategies shift.

So a robust storage engine has to support backward compatibility.

It has to be able to reliably read its current format and all the previous legacy formats.

Hashtag, hashtag, a versioning for backward compatibility.

So to handle that evolution, you have to be able to quickly identify which version of the file you're dealing with before you even try to interpret the data.

What are the common strategies for that?

There's three primary methods.

First is using filename prefixes.

Apache Cassandra, for instance, embeds the version directly in the filename.

They use prefixes like now1 or ma -ma.

This is the fastest way to check, because the system doesn't even have to open the file to know the format.

Okay.

Second is storing the version information externally.

Yes.

The version can be stored in a separate version file.

PostgreSQL does this.

They use a file called PGVersion to hold this info, separating that small critical metadata from the potentially massive index files.

And the third method, and probably the most common for index files, is storing the version directly in the index file header itself.

This is standard practice across many systems, and if you use the file header, you must ensure that the section containing the VASION number or even the entire header block is encoded in a format that never changes.

It has to be readable by every version of the software.

Every single one, old or new.

And this version field is almost always combined with a magic number,

a unique fixed numerical value at the file start that identifies the file type.

You know, this is a B -tree index, or this is a wall file.

If that magic number is wrong, the file is immediately rejected.

Hashtag, tag, tag, B, checksumming for data integrity.

And finally, integrity.

Files on disk are susceptible to physical corruption from hardware wear or software glitches.

We need a way to preemptively identify this corruption and stop bad data from propagating through the system.

And this is the realm of checksums and CRCs.

And the terminology here is important because hash, checksum, and CRC get confused all the time.

They all reduce a large chunk of data to a small number, a sort of digital fingerprint, but their guarantees are vastly different.

Simple checksums offer the weakest guarantee.

They do.

They're computed with simple summation or XOR operations.

They're fast, but they're easily defeated.

They often can't reliably detect corruption where multiple bits are flipped, especially if the changes happen to result in the same arithmetic sum.

So then we step up to cyclic redundancy checks, or CRCs.

CRCs are significantly more robust.

They're mathematically designed to detect what are called burst errors.

A burst error is when a sequence of consecutive bits get corrupted, which is a very common failure mode in physical storage devices.

You know, localized scratch, magnetic interference, read head vibration.

CRCs use polynomial division and lookup tables to achieve this robustness, which makes them the standard choice for storage integrity checks.

And we really have to trust these methods are only for accidental change detection.

Absolutely.

Checksums and CRCs are only for detecting accidental, unintended changes.

They are zero protection against malicious or intentional tampering.

If you need data authenticity against a deliberate attack, you must use a strong cryptographic cache function designed for security.

The CRC's goal is purely hardware noise verification.

And in practice, how are these checksums implemented?

They're computed on the page data before writing, and the resulting checksum value is stored in the page header.

Then when you read the page back into the buffer cache, the database recomputes the checksum for the data and compares it against the stored value.

If there's a mismatch, the page is considered corrupt and it's discarded.

And the final trade -off here, why are checksums almost always computed per page instead of over the whole file?

Well, computing a checksum over the whole file is just impractical because you're rarely reading the entire file at once.

Page -level checks are far more efficient.

They're performed on smaller data subsets, which reduces the computational overhead.

And crucially, if corruption occurs, you only discard that single 4KB or 8KB page rather than losing the integrity of the entire multi -gigabyte index structure.

That containment of failure is essential for system resilience.

Hashtag tag outro.

So we've navigated a pretty intricate journey today.

We've traced the process from a raw data type all the way to a durable, persistent page structure on disk.

And the fundamental insight, I think, is that on -disk design is a really intensive, high -stakes exercise in manual resource management.

The engineer is the architect.

They are.

They're personally responsible for handling platform -specific endianness, creating stable binary layouts, and solving the persistent headaches of fragmentation using these advanced techniques.

And the crowning achievement in this architecture really is the slotted page structure.

It so cleverly provides the necessary stability by totally separating the physical data location, the cells, which are free to move around during defragmentation, from the logical access path.

Right.

The sorted slot offsets, which provide a fixed, stable reference point for searching and indexing, That decoupling is what allows highly mutable B -trees to handle high -volume insertions and deletions efficiently without crippling internal overhead.

So ultimately, the performance, the durability, the reliability of any storage engine, whether it's an old -school B -tree or a modern key -value store, it all rests entirely on how intelligently these underlying file formats manage space, organize bytes, and guarantee integrity with versioning and checksums.

It's the difference between a lightning -fast data lookup and a silent corruption failure that brings your entire system to a halt.

So what does this all mean for you, the person learning these structures?

Well, since the slotted page relies on cell offsets, which are only valid within that single page, their page local,

the database needs a separate, higher -level system to ensure that the pointers between pages, those file -wide page IDs we talked about, remain valid.

Something has to manage that global map.

Exactly.

The component responsible for this is typically the buffer manager or the page cache.

Now consider this.

What happens to your entire system if that lookup table, the one that maps the global page IDs to their constantly shifting physical disk offsets, itself becomes corrupt?

That's something to mull over on your next deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Persistent storage in database systems demands fundamentally different design approaches than in-memory data structures, requiring explicit navigation through offsets and system calls rather than transparent pointer access. Binary encoding forms the foundation for reliable disk storage, with endianness conventions ensuring that numeric values maintain consistent interpretation across heterogeneous hardware platforms. Data representation strategies vary significantly based on whether values are fixed-length primitives, variable-length strings using Pascal or null-terminated formats, or bit-packed boolean flags that maximize storage density. The core organizational unit of database files is the fixed-size page, a structured block containing both payload data and metadata headers that describe its contents and internal organization. Slotted page architecture emerges as an essential design pattern for managing variable-length records within these pages, employing a separation strategy where actual data occupies one region while offset pointers to individual records occupy another. This dual-region approach enables efficient insertion and deletion operations by decoupling record boundaries from their physical byte positions, allowing internal defragmentation without breaking external references. Space allocation algorithms must balance simplicity with efficiency, choosing between first-fit and best-fit strategies while maintaining accounting structures like availability lists and free-block registries to prevent fragmentation pathologies. Record deletion creates fragmentation challenges that defragmentation processes address through selective data reorganization and pointer remapping. Long-term system evolution requires versioning mechanisms that allow file format specifications to change while remaining backward compatible with older data. Data integrity protection relies on checksums and cyclic redundancy checks that detect corruption from hardware failures or software bugs, enabling early identification of problems before they propagate through the system. Together, these layered organizational patterns transform the raw byte stream of disk storage into a reliable, performant foundation for database operations, balancing competing demands for access speed, storage efficiency, and data safety.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 3: File Formats

Related Chapters