Chapter 14: File-System Implementation: Allocation Methods and Free-Space Management

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Have you ever stopped to think, like really think, of what's happening inside your computer when you just click save, or, you know, save a photo on your phone.

We use files all the time, obviously.

But the actual mechanics, how they're stored, organized, found again, it's this incredibly complex kind of hidden world.

And that's what we're doing today, a deep dive into that hidden world.

We're looking at file system implementation using the classic operating system concepts book by Silberschatz, Galvin and Gagne as our guide.

We want to demystify how your operating system, whether it's on your phone, your laptop, even those massive cloud servers, handles the basic essential job of storing your data.

That's right.

File systems, they're really the unsung heroes of computing, aren't they?

They act as this invulnerable translator between how we see files like documents, pictures,

and the, well, the raw physics of storing data bits on a physical disk.

So yeah, we'll explore the clever structures involved,

how data is laid out, how things are sped up, and even how systems recover when inevitably something crashes.

It's all about getting your data stored and back to you quickly and reliably.

Okay, so let's start right at the foundation, the file system structure.

Where do these things actually permanently live?

They live on what we call secondary storage.

For a long time, that meant hard disk drives, HDDs, spinning platters, read -write heads moving back and forth.

They're rewritable.

You can jump straight to a specific spot, pretty standard stuff.

But now there's this huge shift towards NVM non -volatile memory.

Think flash storage, SSDs in your laptop.

They work very differently.

You can't just rewrite a single spot easily, and their performance is, well, different.

But yeah, that's where most of our digital stuff lives now.

And whether it's an old HDD or a new SSD, the basic unit of transfer is the block, right?

Like a chunk of data.

Exactly.

Usually maybe 512 bytes, or more commonly now, 4 kilobytes, maybe more.

It's the chunk size for reading and writing, super important concept.

So when designing a file system, what are the big hurdles, the main challenges?

Well, there are really two core things to figure out.

First, how does it look to you, the user, you know, files, folders, properties like names and dates, operations like create, delete, the whole directory tree structure.

And second, this is the tricky part, how do you map that nice logical user view onto the actual physical blocks on the storage device?

That takes some clever algorithm and data structures.

And operating systems handle this with layers, you said, like stacking building blocks.

Yeah, that's a great analogy.

It's a layered approach.

Think of it like this.

Way down at the bottom, touching the actual hardware, is the IO control layer.

This has the device drivers.

It translates high level commands, like get me block 123, into the very specific electrical signals the disk controller understands.

It's the hardware whisperer.

One step up, you've got the basic file system.

This layer thinks in terms of logical blocks, not physical ones.

It issues generic commands, read block X, write block Y.

It also handles scheduling those IO requests, trying to be efficient and manages memory buffers for caching blocks.

Makes sense.

And above that.

The file organization module.

Now we're getting closer to the concept of a file.

This layer knows that a file is made up of logical blocks.

And crucially, it talks to the free space manager, the component that keeps track of all the unused blocks on the disk, like a real estate agent for disk space.

Gotcha.

And the top layer.

That's the logical file system.

This is what you mostly interact with, indirectly.

It manages all the metadata, everything about a file, except the data itself.

So directory structure, mapping file names to internal IDs, and managing the file control blocks, or FCBs.

In UNIX, these are called inodes.

The FCB holds the critical info, ownership permissions, dates, size, and pointers to the actual data blocks.

Protection happens here too.

Wow.

So this layering is why an OS can support, say, both an old hard drive and a brand new SSD, without a total rewrite.

Exactly.

You can swap out the lower layers that device drivers without messing up the logical file system layer the user applications see.

There might be a tiny performance cost with more layers, a little overhead, but the flexibility you gain is usually massive.

It's a worthwhile trade -off.

And this leads to all sorts of different file systems we see out there.

Oh, absolutely.

You've got specialized ones like ISO 9660 for CDs, the classic UFS for UNIX, the Windows family, FAT, FAT32, NTFS, and, of course, XC3, XC4, common on Linux, plus things like distributed file systems for accessing files over networks.

Is there still innovation happening here?

It feels like a pretty solved problem.

Not at all.

It's a super active research area.

Google, for example, built its own file system for its unique, massive scale needs.

And there's this really cool thing called ESE file system in user space.

It lets developers write file systems that run like regular programs, not deep inside the OS kernel.

That makes it much easier to experiment with new ideas or create very specialized file systems without risking the stability of the whole system.

It's pretty neat.

OK, so that's the structure.

Let's dig into operations.

What actually happens, technically, when I double -click a file icon to open it, like the open system call?

Right.

So when you open a file, a lot happens involving structures both on the disk itself and cast in your computer's RAM.

On the disk, when a file system volume is mounted, made available, the OS reads in some critical info.

There's usually a boot control block needed if you want to boot the OS from that disk.

Then there's a volume control block, sometimes called the super block in UNIX.

It holds key stats for the whole volume.

Total blocks, block size, how many blocks are free, pointers to find those free blocks, et cetera.

OK, volume -wide info.

Yep.

Then there's the directory structure itself, which maps the file names you see to internal identifiers.

And finally, for each file, there's that per -file control block, the FCB or INNO, that's the file's passport.

It has the permissions, owner, dates, file size, and crucially, the pointers or information needed to locate the actual data blocks on the disk.

It's the heart of the file's metadata.

And you mentioned caching.

How does that fit in to speed up opening files?

Hugely important.

VOS keeps several structures in memory, in RAM, to avoid slow disk access.

There's usually an in -memory mount table listing the mounted volumes.

A directory structure cached to speed up looking for files by name.

And a really important one, a system -wide open file table.

This holds copies of the FCBs for all files currently open by any program on the system.

It also tracks how many programs have each file open.

System -wide.

Yeah.

So multiple programs can share.

Exactly.

And then your specific program gets an entry in its own per -process open file table.

This doesn't duplicate the whole FCB again.

It just points to the entry in the system -wide table.

It also stores info specific to your program's use of the file, like where you are currently reading or writing the file pointer, and what mode you opened it in, read, write, etc.

So when I call open.

The OS first checks that system -wide table.

Is this file already open by someone else?

If yes, great.

It just increments the open count and gives your process a pointer to the existing entry.

Super fast.

If not, it has to go find the file in the directory structure, hopefully hitting the directory cache.

Then it reads the file's FCB from disk into the system -wide table.

Finally, it creates the entry in your processes table pointing to the system -wide entry and gives you back a file descriptor or handle.

All subsequent reads -writes use that handle.

So it just reverses that.

Pretty much.

It removes the entry from your processes table.

It decrements the counter in the system -wide table.

Only when that counter hits zero, meaning nobody has the file open anymore, does the OS potentially write any updated metadata, like file size or modification time, back to the disk to make it permanent.

So the caching is immense.

That 85 % hit rate you mentioned for BSDU and IX,

that avoids a ton of slow disk I .O.

Absolutely.

It's fundamental to modern OS performance.

Without aggressive caching, using files would feel incredibly sluggish.

Okay, we know how to find the FCB.

But how do the directories themselves work?

How does the OS efficiently find mydocument .txt inside a folder with thousands of items?

Good question.

The implementation of the directory itself matters a lot for performance.

The simplest way is just a linear list.

Literally a list, file name, pointer to FCB, next file name, pointer to FCB, and so on.

Easy to code, I bet.

Yep.

But searching.

If you have thousands of files, you might have to read and compare every single entry slow.

Ouch.

Caching helps, of course.

You could also keep the list sorted alphabetically.

Then you could use a faster binary search.

But then adding or deleting files gets tricky because you have to maintain the sort order.

You're shifting everything around.

Exactly.

So a more common, more sophisticated method is a hash table.

You take the file name, run it through a mathematical hash function, and it gives you an index, basically telling you where to look in a table.

Ideally, it points you right to the file's entry or maybe to a short list if multiple file names happen to hash to the same spot that's called a collision.

Much faster searches, then.

Way faster, usually.

The main challenges are handling those collisions efficiently and figuring out what to do if the hash table itself gets too full.

You might need to rebuild it, which can be a temporary performance hit.

Okay, so we can find the file to info.

Now the big one.

Allocation methods.

How does the system actually decide where on the physical disk to put the data blocks for a file?

Right.

This is crucial.

How do we carve up that disk space?

There are three main strategies.

First, contiguous allocation.

Simplest idea.

Store the entire file in one continuous chunk of blocks.

The directory entry just needs the starting block address and the total length.

Like one long line on the disk.

Exactly.

And the big win is speed, especially for old spinning disks.

Reading the file sequentially or jumping to a specific spot, direct access, is super fast because the disk head barely has to move.

Sounds good.

What's the catch?

Fragmentation.

Specifically, external fragmentation.

As you create, delete, and resize files, the free space gets broken up into lots of small non -contiguous holes.

You might have enough total free space for a new large file, but no single continuous chunk big enough.

It's like having lots of little gaps on your bookshelf, but no room for a big encyclopedia.

Ah, I see.

So wasted space, essentially.

Yeah.

Or you need to run a slow compaction process to shuffle everything around.

Plus, you usually need to know the file's final size when you create it, which is often impossible.

If you guess wrong, you either waste space or can't easily make the file bigger later.

Okay, so contiguous has issues.

What's next?

Linked allocation.

This tackles fragmentation head on.

The blocks of a file can be scattered anywhere on the disk.

Each block simply contains a pointer to the next block in the file.

The directory entry just points to the very first block and maybe the last for convenience.

So like a treasure hunt, following clues from block to block.

Pretty much.

The big advantages.

No external fragmentation.

Files can grow easily.

You don't need to know the size up front.

But I'm sensing a downside for finding things.

You got it.

Direct access is terrible.

Did to block 100, you literally have to read block one, find the pointer to block two, read block two, find the pointer, 99 times.

Lots of slow disk seeks.

Yikes.

There's also a tiny bit of space wasted in each block to store that pointer.

And if one pointer gets corrupted, you could lose the rest of the file.

A famous variation tried to fix the direct access problem.

The file allocation table, or FAT, used in MS -DOS and early Windows.

FAT32 and stuff.

Right.

FAT pulls all those next block pointers out of the data blocks and puts them into a single big table at the beginning of the disk volume.

The directory points to the first block and then you look up that block number in the FAT to find the next block number and so on.

So finding block 100 is faster.

Much faster.

Especially if the FAT itself is cached in memory, you just do table lookups instead of disk reads for each step.

Still not perfect, but way better than plain linked allocation.

Okay, what's the third way?

Indexed allocation.

This tries to combine the best of both.

It gets rid of external fragmentation and supports fast direct access.

How?

It puts all the pointers for a single file together into a special block called an index block.

The directory entry just points to this one index block.

The first pointer in the index block points to the first data block, the second pointer points to the second data block, and so on.

So to find block 100, you just read the index block and look up the hundredth pointer.

Exactly.

Fast direct access.

No external fragmentation.

Sounds like the winner.

Any drawbacks?

Well, you waste a whole block just for pointers.

Even for tiny files.

That's one thing.

And what if the file is huge?

So big that all its block pointers don't fit into a single index block.

Good point.

What then?

You need strategies.

You could link multiple index blocks together.

Or more commonly, you use a multi -level index.

Think of it like a tree.

The first index block doesn't point to data blocks, but to other index blocks.

Those second level blocks might point to data or even to third level index blocks for truly enormous files.

Wow.

Layers upon layers.

Yeah.

And this is where pointer size becomes critical.

With 32 -bit pointers, you hit a limit around 4 gigabytes per file.

That's why modern systems use 64 -bit, or even like a ZFS, 128 -bit pointers to handle the massive file sizes we see today.

Unix systems often use a really clever combined scheme in their inodes.

They have, say, 12 direct pointers right in the inode for small files super fast.

Then they have a pointer to a single indirect block, which points to data blocks.

A pointer to a double indirect block points to single indirect blocks.

And maybe a triple indirect block.

It scales really elegantly from tiny files to huge ones.

Okay, quick summary.

Contiguous is fastest with fragments.

Linked avoids fragments, but is slow for direct access.

Index is good for direct access.

No external fragments, but has pointer overhead and needs tricks for big files.

Hybrids often win.

That's a great summary.

And remember, the gap between how fast CPUs are and how slow disks are, especially spinning ones, is huge and growing.

So OS designers will implement these complex schemes using thousands of CPU instructions just to save a few milliseconds of disk head movement.

It's that critical.

Though with NVM SSDs, the game changes a bit.

No seek time, so optimizations focus more on reducing write amplification and managing erase cycles.

Right.

So, okay, we've allocated space.

What happens when we delete a file?

How does the system reclaim that space?

Free space management.

Crucial.

You can't just leave deleted file blocks unusable.

The system needs a way to crack every single free block.

A very common method is the bit vector, or bitmap.

It's simple.

One bit for every block in the disk.

One means free, zero means allocated, or vice versa.

Like a giant checklist.

Exactly.

Finding a free block, or even a sequence of contiguous free blocks, can be very fast using bit manipulation instructions on the CPU.

The main drawback.

For huge disks, that bitmap can get really, really big.

Terabytes of storage mean megabytes just for the bitmap.

Keeping it all in RAM can be tough.

Okay, what else?

You could use a linked list of free blocks.

The system just stores a pointer to the first free block.

That block contains a pointer to the next free block, and so on, chaining them all together.

Simpler, maybe, but slow to find,

say, a hundred free blocks.

Yeah, traversing the list is inefficient,

but often you only need one free block, so you just grab the first one off the list.

It's simple to implement.

FAT systems actually integrate this with the main FAT structure.

Variations exist, like grouping, where the first free block stores pointers to, say, the next 50 free blocks, making it faster to grab chunks.

Or counting, where instead of listing every free block, you store the start address and the count of a contiguous run of free blocks.

This makes the list much shorter if you have large areas of free space.

And you mentioned ZFS before.

Does it do something fancy?

Oh yeah.

ZFS deals with potentially enormous storage pools, so it uses space maps.

It divides the space into chunks called meta -slabs.

Each meta -slab has its own space map, which is essentially a log of allocations and frees using that counting method.

These maps are loaded into memory efficiently using tree structures.

It scales incredibly well.

And for SSDs, you mentioned trim.

Right, trimming.

This is specific to NVM, like SSDs.

Flash memory needs to be erased before it can be rewritten, and erasing is slow and done in large chunks.

When you just delete a file normally, the OS marks the blocks free internally, but the SSD controller doesn't know those pages are invalid anymore.

So when you later try to write new data, the SSD might have to do a slow erase cycle first, right when you need speed.

This causes performance to degrade over time the right cliff.

The trim command lets the OS explicitly tell the SSD, hey, these blocks are no longer needed.

The SSD can then do garbage collection and erase those blocks proactively in the background maintaining good performance.

It's vital for SSD health and speed.

It's amazing how much thought goes into just managing free space.

It all ties back to efficiency and performance.

Storage is off in the bottleneck, isn't it?

Absolutely.

Efficiency is about smart allocation, like how UNINIX spreads inodes out to keep them near their data, or using clustering, allocating several blocks at once to reduce overhead.

There are trade -offs, too, like should the OS update the last access time on a file every single time it's read?

It's useful, but it means an extra disk write just for reading, which slows things down.

Many systems make this optional now, and we saw how pointer sizes evolved from 12 -bit FAT limiting disks to 32 -milliby up to today's 64 -bit or 128 -bit systems handling astronomical amounts of data.

Designing for the future is hard.

And for performance, caching is key, as we discussed.

Caching, caching, caching, both on the disk controller itself and in the OS's main memory.

We talked about avoiding that double caching issue by using unified virtual memory, where both regular file I .O.

read -write and memory -mount files use the same page cache managed by the virtual memory system.

Much more efficient.

And how does the cache decide what to keep?

Usually something like LRU least recently used is a good general strategy.

But if the OS detects you're reading a huge file sequentially, it might use free -behind discarding blocks right after use and read -ahead proactively fetching upcoming blocks to optimize throughput.

Also, most writes are asynchronous.

The write call returns immediately after the data is put in the cache, and the OS writes it to disk later.

This feels fast.

But sometimes, like for database transactions or critical metadata updates, you need synchronous writes that don't return until the data is safely on the physical disk.

Slower, but safer.

Okay, safety leads to the next big topic.

Recovery.

What happens if the power goes out or the system crashes while all this complex stuff is happening?

How do we ensure data isn't corrupted?

Yeah, that's the nightmare scenario.

Caching makes things fast, but it means data might be in memory but not yet on disk when a crash occurs.

The traditional approach was consistency checking.

After a crash, you'd run a utility like foodflock, file system check, and UNIX.

It sends all the file system metadata directories, inodes, free lists looking for inconsistencies like blocks allocated to multiple files, or free blocks also listed in a file.

It tries to fix them.

But you said that's slow.

Painfully slow on large disks, hours sometimes.

And it can't always fix everything perfectly.

Sometimes it has to guess or even discard data.

A much better approach, now very common, is journaling, or log -structured file systems.

Think NTFS, XT3, XT4, ZFS.

It borrows ideas from databases.

The core idea.

Before you make any changes to the actual file system structures, like allocating a block or updating a directory, you first write a description of what you're about to do to a sequential log file that journal.

Like writing down your intentions first.

Exactly.

Once that intention is safely written to the log on disk, it's committed.

Then the OS can apply the change to the main file system structures at its leisure, if the system crashes.

On reboot, it just looks at the log.

Any committed transactions that weren't fully completed, it just replays them from the log.

Done.

Consistency restored very quickly, without scanning the whole disk.

That sounds much faster and safer.

It is.

And it has a side benefit.

Writing to the sequential log is usually much faster on spinning disks than doing lots of small, random updates all over the place for metadata, so it often improves performance, too.

Are there other recovery strategies?

Yep.

Another powerful one is Copy -on -Write, or C -O -W.

Systems like Waffle will get to that, and ZFS use this.

The golden rule.

Never overwrite data in place.

When you modify a block, you write the new version to a completely new, unused block on disk.

Then you update the planers higher up, like in the index block or directory, to point to this new version.

So the old version is still there for a moment.

Exactly.

Until the pointers are updated atomically.

This makes crashes much less dangerous, and it enables an amazing feature.

Snapshots.

To take a snapshot, you essentially just preserve the old pointers, like the root pointer of the file system, from that moment in time.

Since nothing gets overwritten, that old pointer still points to the consistent state of the file system as it was.

It's incredibly fast and space efficient.

Only changed blocks take up new space after the snapshot.

Wow.

Instant backups, almost.

Pretty much.

ZFS goes even further, checksumming all data and metadata blocks.

If it detects corruption, maybe comparing with RAID parity, it can often self -heal.

It rarely even needs a traditional consistency check.

And of course, none of this replaces good old backup and restore.

You still need copies on separate media, tape, cloud, another disk, in case the whole device fails or ransomware hits.

Regular backups stored securely, ideally off -site, are non -negotiable for valuable data.

Absolutely.

You mentioned WAFL.

Let's use that as a concrete example.

What makes it special?

WAFL Write Anywhere File Layout from NetApp is really interesting.

It was designed specifically for network file servers, NFS, CIFS, which often face tons of small random write operations.

It uses copy -on -write heavily.

Like ZFS, it never overwrites data.

All its metadata in nodes, free block maps, are actually stored in files themselves, which makes the file system very flexible.

Its snapshot mechanism is central.

When you take a snapshot, it just duplicates the root inode.

New writes go to new blocks, modifying the current view.

The snapshot inode still points to the old, untouched blocks.

So snapshots are super cheap and fast.

Extremely.

You can take tons of them.

This is great for backups, testing software updates on a safe copy, versioning, lots of uses.

They also have clones, which are basically writable snapshots, efficient for creating development or test environments that start as a copy of production.

And replication builds on this, sending snapshots efficiently to another system for disaster recovery, only transmitting the blocks that changed since the last snapshot.

Very cool.

And you mentioned Apple's APFS briefly.

Yeah, APFS, Apple File System, is another modern example.

Designed for everything from an Apple Watch to a Mac Pro.

It's got all the modern bells and whistles, 64 -bit everything, copy -on -write, snapshots, clones.

It has neat features like state sharing, where multiple volumes can flexibly share the underlying free space in a container.

Also, things like fast directory sizing,

atomic safe saves for operations like renames, so they either are complete fully or not at all, even if you crash midway, and IO optimizations for flash storage.

It really shows where things are heading.

Wow.

Okay, we've covered a ton of ground, from basic blocks and layers all the way to journaling, copy -on -write, and these advanced file systems.

It really is staggering how much complex, clever engineering sits beneath that simple save icon we click every day.

It truly is.

Decades of innovation to make storing and retrieving billions of bits feel effortless and reliable.

And it makes you wonder, what's next?

As storage tech keeps changing persistent memory, maybe DNA storage someday, who knows?

How will file systems adapt?

What new trade -offs between speed, cost, and maybe even energy use will emerge?

Definitely something to ponder.

Maybe next time you save a file or stream a movie, you'll spare a thought for those hidden layers doing the work.

Thank you for joining us on this deep dive into the fascinating world of file system implementation.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Operating systems manage files through layered abstractions that translate user-level operations into efficient on-disk data structures and algorithms. File-system implementation requires careful coordination of several architectural components, starting with control blocks that store essential metadata. The boot control block contains instructions for system startup, while volume control blocks maintain information about entire disk partitions, and file control blocks preserve attributes and locations for individual files. In-memory structures supplement these on-disk mechanisms by caching frequently accessed metadata to reduce I/O overhead. Allocation methods determine how files are physically arranged on storage devices, with contiguous allocation providing fast sequential access but suffering from external fragmentation, linked allocation eliminating fragmentation through pointer chains but complicating random access, and indexed allocation using pointer blocks to combine benefits of both approaches. Free-space management equally influences system performance, employing bitmaps for space-efficient tracking, linked lists for minimal memory overhead, or grouping schemes that balance efficiency with scalability. Directory implementations range from linear lists suitable for small directories to hash tables enabling rapid file lookups in larger systems. Virtual file systems provide an abstraction layer allowing multiple file-system types to coexist transparently, simplifying system architecture and portability. File-system reliability depends on recovery mechanisms that restore consistency after unexpected failures, with techniques including consistency checking that validates structural integrity, journaling that logs operations before execution, and log-structured approaches that treat the entire file system as a sequence of updates. Real-world implementations like UNIX with its inode-based structure and Windows NTFS demonstrate how these theoretical concepts translate into production systems with specific performance characteristics, reliability guarantees, and operational constraints. Understanding these implementation details enables appreciation of how modern operating systems achieve the simultaneous goals of performance, reliability, and efficient resource utilization at the storage layer.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 14: File-System Implementation: Allocation Methods and Free-Space Management

Related Chapters