Chapter 11: Mass-Storage Structure: HDD, NVM, RAID, and Swap-Space Management

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

You tap an app, upload a photo, save a document.

It feels like digital magic, right?

It really does.

But have you ever stopped to think about where all that data actually lives?

I mean, it's not just on your computer or in the cloud in some amorphous digital ether.

No, there's a whole universe of hardware and software working tirelessly behind the scenes.

It's quite amazing, actually.

Exactly.

And here on The Deep Dive, our mission is to take complex concepts like this and really unpack them.

We want to give you that shortcut to being truly well -informed.

Hopefully with some surprising facts.

And just enough humor to keep you hooked, yeah.

And today, we're doing just that.

We're diving deep into the fascinating world of mass storage structure.

Sounds good.

We're drawing insights from the foundational text, Operating System Concepts, 10th edition.

Ah, the classic.

It is.

We'll explore everything from the intricate mechanical wonders humming inside your laptop to the vast invisible storage powering the entire internet.

So we're going to demystify how operating systems manage all of your precious data.

Right.

And the clever tricks they use to make slow hardware seem blazing fast.

How they protect your files from disaster.

That's a big one.

Absolutely.

And even what happens when storage devices inevitably start to fail.

Because well, they do.

They do.

Get ready for some serious aha moments.

Because once you see what's truly happening, you'll never look at a save button the same way again.

Ah, probably not.

Okay.

Let's set the stage, then.

When we talk about the primary types of non -volatile storage in modern computers, the kind that keeps your data even when the power's off.

The abortant stuff.

Exactly.

We're really looking at two main contenders, hard disk drives or HDDs.

The old workhorses.

And non -volatile memory devices, NVMs.

The speedy newcomers, relatively speaking.

Right.

Of course, there's also slower, larger tertiary storage like old magnetic tapes, maybe optical disks or the vastness of cloud storage.

Yeah, but for today's focus.

Our deep dive today focuses on those first two foundational workhorses, the HDDs and NVMs.

That's right.

And for decades, the mechanical workhorse, the hard disk drive, was absolutely king.

Totally dominated.

To picture it, imagine a stack of shiny circular platters.

A bit like old CDs, right, ranging from under two inches to over three inches in diameter.

Both surfaces of each platter are covered with a magnetic material.

Information is stored by magnetizing tiny patterns on these surfaces.

Okay.

Now here's the cool part.

A read -write head literally flies, just micrometers, that's millions of a meter above each surface.

Wow, that's close.

Incredibly close.

Barely touching the air cushion created by the spinning platters.

And all these heads are attached to a single arm that moves them as one unit.

So it's kind of like a super precise, super fast record player with the arm moving across the record to find the right song.

I love that analogy.

Exactly.

It's a great analogy.

And just like a record, each platter surface is divided into concentric circular tracks.

These tracks are then subdivided into sectors, the smallest unit of data transfer.

Traditionally, that was 512 bytes, tiny.

Now many drives use 4KB sectors.

And a cylinder is a key concept here.

It's the set of all tracks that are at the exact same arm position across all platters.

Like a vertical slice through the stack.

Precisely, thousands of cylinders,

hundreds of sectors per track.

It's a miniature high speed metropolis of data.

Now when it comes to performance, speed is everything.

The platter spin incredibly fast, anywhere from 5 ,400 to 15 ,000 revolutions per minute.

Which is what per second?

That's like 60 to 250 times per second.

This affects the transfer rate, how fast data flows to your computer.

But the real bottleneck is positioning time, or random access time.

Right, getting the head to the right place.

Exactly.

This has two parts.

Seek time, the time it takes the arm to move to the right cylinder.

The physical movement.

Yes.

And rotational latency, the time you wait for the desired sector to spin around to the read -write head.

Ah, okay.

So you seek, then you wait.

You got it.

Think about it.

Waiting for something to physically move is always going to be slower than electrical signals.

Makes sense.

Modern drives try to mitigate this with dgram buffers in their controllers.

Essentially a small, super fast staging area for data.

A bit of cash.

And what about common pitfalls?

I've heard the term head crash before, and it sounds pretty terrifying for your data.

It is terrifying.

A head crash is precisely what it sounds like.

The delicate read -write head, which normally floats on that cushion of air,

suddenly makes contact with the platter surface.

Oh.

Big ouch.

This can scratch and damage the magnetic layer, leading to irreversible data loss, and often rendering the entire drive useless.

So backups are crucial, then.

It's a stark reminder of why backups aren't just a good idea.

They're absolutely essential.

And while manufacturers publish impressive transfer rates.

Yeah, the numbers on the box.

Exactly.

Real -world effective rates are often lower, due to all that underlying work and overhead.

Positioning time, mainly.

Right.

Interestingly, though, many modern HDDs are designed to be removable or hot -swappable.

Meaning you can pull them out while the system is running?

Yep.

You can add or remove them without powering down your system, which is a neat trick for expanding storage on the fly, especially in servers.

Okay, neat.

Now let's pivot to the electrical sprinters.

Yeah.

Non -volatile memory devices or NVMs.

The SSDs and things.

Right.

The core difference.

No moving parts at all.

They're entirely electrical, primarily made up of a controller and flash -nan semiconductor chips.

Okay.

You encounter them everywhere.

The SSDs that look like traditional hard drives, tiny USB drives, and even chips, surface -mounted directly onto motherboards and devices like your smartphone.

And the advantages here seem immediate, right?

More reliable, because there's nothing to physically break.

No moving parts to crash.

Incredibly faster, because there's no mechanical seek time or rotational latency.

Essentially zero positioning time.

And they sit power.

What was the real breakthrough that made them so revolutionary?

You nailed the key advantages.

The real breakthrough was simply the ability to store non -volatile data electrically in a cost -effective, durable way.

Right.

The cost used to be prohibitive.

Exactly.

Historically, NVMs were far more expensive per megabyte and had lower capacity, but that's rapidly changing.

Capacity has skyrocketed.

Prices have dropped.

Yeah.

They're pretty standard now.

They're the standard in most laptops and mobile devices for their speed, small size, and energy efficiency.

They also connect directly to super -fast interfaces like NVMe.

Which uses… PCIe.

Right.

It uses the computer's PCIe bus for maximum data throughput.

Much faster than SATA.

Yeah.

You'll even find NVMs used as cache tiers in larger systems.

Like a fast buffer.

Precisely.

Acting as a super -fast pit stop for frequently accessed data to optimize overall system performance.

Okay.

Let's unpack this.

While NVMs are amazing, I know NanFlash memory has its own unique quirks.

It's not just a simple flip of a switch, is it?

Absolutely not.

NanFlash has some distinct characteristics that make its management… well, fascinating.

You can read and write data in small page increments.

Typically 4KB or 8KB.

Okay.

Pages.

But here's the kicker.

You cannot overwrite data directly.

Wait.

What?

How do you change anything, then?

To change anything, the larger block containing that page, a block is made up of many pages must first be erased.

Erased first.

Got it.

And erasing is significantly slower than reading or writing.

What's more, each NanCell has a finite lifespan.

It deteriorates with every erase cycle.

It wears out, like, physically.

Kind of, yeah.

After about a hundred thousand program erase cycles, give or take, it can no longer reliably hold data.

So for SSDs, lifespan isn't measured in years, but in drive rights per day.

Or DWPD.

DWPD, drive rights per day.

Right.

How many times you can write the drive's entire capacity each day over its warranty period.

Wow.

So there's this hidden degradation happening.

How does the drive manage that so we don't just hit a wall one day and lose everything?

That's where the NVM device controller becomes incredibly clever.

It manages these limitations mostly transparently to your operating system.

It hides the complexity.

Exactly.

It uses algorithms like the flash translation layer, FTL, which acts like a smart postal worker.

When you tell it to write to a logical address, it maps that to an available physical page, keeping track of valid and invalid pages.

Okay, like an address book.

A very dynamic one.

Then there's garbage collection.

This identifies blocks that contain pages marked as invalid.

Data that's been deleted or moved?

Right.

And it copies any remaining valid data from those blocks to a new empty block.

Then it erases the old block to make it reusable.

So it's constantly tidying up.

Constantly.

To ensure there's always fresh space for these operations, devices use over -provisioning.

Meaning they have extra hidden storage.

Precisely.

Setting aside a percentage of pages, often 20 % or so as always,

available internal space for writes and garbage collection.

Clever.

And to distribute where evenly and maximize lifespan, where leveling algorithms ensure erasures are spread across the entire device, preventing any single area from wearing out prematurely.

So it spreads the load.

It does.

The true magic of your SSD isn't just speed.

It's this invisible, constant dance its controller performs to prolong its life.

That makes perfect sense.

But all this behind -the -scenes copying and erasing sounds like it could lead to a concept called write amplification.

Can you explain what that means for a user's experience?

You're right to spot that.

Write amplification occurs because a single write operation from your computer can trigger multiple internal I .O.

operations within the NVM device due to garbage collection.

Ah, so one write becomes many internal writes.

Exactly.

For example, if you change a single byte in a page, the drive might have to read the entire original block, copy valid data to a new block, then write the updated data, and finally erase the old block.

So one small user write can become several large internal writes.

And that affects performance.

It directly impacts the device's overall write performance, especially as the drive fills up.

It's one reason why SSDs can sometimes feel slower when they're nearly full.

Right, got it.

And just like HDDs, NVMs also employ error correcting codes, or ECC, for data protection.

We'll touch on ECC later.

We will.

And they can be part of RAID configurations for catastrophic failure protection too.

Okay.

Now, a slightly quirky but important type of mass storage is the RAM drive.

Using RAM.

But that's volatile, right?

Data disappears when the power goes off.

It is, and it does.

But a device driver can carve out a section of your system's regular DRAM and present it to the operating system as if it were a physical storage device.

So why on earth would you use volatile memory for mass storage if the data vanishes on shutdown?

What's the practical takeaway here?

The takeaway is raw speed, blinding speed.

While NDMs are fast, DRAM is orders of magnitude faster.

Okay, so it's for temporary stuff.

Exactly.

RAM drives are perfect for temporary files, caches, or data sharing between programs that need incredibly quick access and where persistence isn't required.

Think of something like a video editor scratch disk.

Yeah, temporary render files.

Or compiling large code bases where temporary files are generated and deleted constantly.

You see this in Linux with DRAM, on Mac OS, and often in temporary file systems like Linux's TAMS or Innitert.

What's Innitert?

That's a temporary root file system used during boot up before the real drives are mounted.

It's a clever way to leverage extreme speed for non -critical ephemeral data.

Interesting niche.

Shifting gears.

Let's talk about how all this storage actually communicates with your computer.

The connections.

Right.

Storage devices attach via either the internal system bus or an IO bus.

Common examples you might recognize are SATA for many drives.

The standard connector for a while.

Yup.

NVMe for cutting edge SSDs, which connects directly to the PCIe bus for maximum speed.

Much faster.

USB for external drives.

And fiber channel for high end server storage.

These connections are managed by electronic processors called controllers.

You have a host controller on the computer side and a device controller built into the storage device itself, often with its own cache.

So two controllers talking to each other.

Basically yes.

Your computer sends commands to its host controller, which then talks to the device controller to execute the input -output, or IO, operation.

I'm curious, with all this underlying complexity cylinders, platters, pages, blocks,

where How does the operating system make sense of it all for us?

How do they simplify addressing?

That's a great question, and it's where abstraction is absolutely key.

Operating systems simplify things dramatically.

Thank goodness.

They address storage devices as a large one -dimensional array of logical blocks, where a logical block is simply the smallest unit of transfer.

Just a numbered list of blocks.

Exactly.

The OS just uses a logical block address, or LBA.

So instead of go to cylinder 5, track 10, sector 3, the OS just says go to LBA 5342.

Much simpler.

For HDDs, LBA 0 might map to the first sector on the outermost cylinder, and it proceeds sequentially from there.

For NVMs, it maps from that complex chip block page structure to this simple logical block array.

Okay.

And the fascinating part is that modern drives handle the complex LBA to physical mapping internally.

The drive does the translation.

The drive's own controller does.

If a sector goes bad, the drive's controller automatically substitutes a spare sector for it.

So the OS still thinks it's accessing the same logical block, even though it's physically elsewhere.

Wow.

Completely hidden.

This abstraction layer is vital for both reliability and performance, as it lets the drive optimize its own internal operations without the OS needing to know every physical detail.

Okay.

That makes sense.

Abstraction is powerful.

Now let's dive into orchestrating performance, specifically through storage scheduling.

Making things faster.

Right.

For mechanical hard disk drives, the fundamental goals are always to minimize access time.

The seek time and rotational latency.

Exactly.

And maximize how much data we can transfer, the bandwidth.

And I imagine scheduling is incredibly critical for HDDs because of that mechanical movement, the seek time you mentioned earlier.

That's the biggest bottleneck, right?

Precisely.

Unlike NVMs, where electrical signals are almost instantaneous, HDDs have to physically move that arm across the platters.

Seeking across cylinders is by far the slowest part of any IO operation.

Takes milliseconds, which is an eternity for a computer.

It really is.

When your computer needs to read or write data, it issues a request.

If the drive is busy, that request gets put into a queue.

A waiting list.

And this queue is where the operating system's device drivers can apply clever scheduling algorithms to optimize performance.

Okay.

Even though drives hide physical details, these algorithms still assume that requests for logical block addresses that are close together are likely physically close.

Because of how LBAs usually map.

Right.

So they group them to reduce that slow head movement.

Makes sense.

Minimize the arms swinging back and forth.

Exactly.

Let's walk through some of these algorithms with an example.

Imagine our diskhead is currently at cylinder 53,

and we have a queue of requests for data on cylinders.

98, 183, 37, 122, 14, 124, 65, and 67.

Okay, got the numbers.

Head at 53.

First the simplest.

Yeah.

FCFS.

Or first come, first served.

Just handle them in order.

Seems fair.

It's fair, yes, but often incredibly inefficient.

Starting at 53, the head would go to 98.

Okay, move out.

And all the way out to 183.

Let's jump.

Then jump all the way back to 37.

Oof.

And out to 122, back to 14, out to 124, back to 65, and finally 67.

That sounds awful.

Lots of movement.

You can visualize that wild back and forth movement.

It's a huge amount of wasted time.

Total head movement is massive.

Okay,

so FCFS is simple,

but potentially terrible for performance.

Next, scan, often called the elevator algorithm.

Like an elevator in a building.

Exactly.

The disk arm moves in one direction, servicing requests it passes until it hits the end of the disk, then reverses.

Okay.

So if our head is at 53 and moving towards cylinder zero, going inwards, it would pick up 37, then 14.

Once it reaches zero, it reverses and moves towards the other end, say, cylinder 199.

Head out now.

And on that outward sweep, it services 65, 67, 98, 122, 124, and 183.

Much smoother.

Less back and forth.

Significantly reduces total head movement and offers better fairness.

But a request just behind the head might still have to wait for a full sweep across the disk and back.

Right, if you just missed the elevator.

Pretty much.

A variation of scan is C -scan, or circular scan.

This provides even more uniform wait times.

Sounds different.

Like scan, it moves in one direction, servicing requests.

But when it reaches the end of the disk, instead of reversing and servicing requests on the way back, it immediately jumps back to the very beginning of the disk without servicing anything on that return trip.

A quick return to the start.

Right.

And then it starts a new sweep outwards again.

So from 53, moving towards 199, it would service 65, 67, 98, 122, 124, 183.

At 199, it instantly jumps back to zero, then on its next sweep outwards, it would service 14 and 37.

Ah, so it only serves in one direction.

Treats the disk like a circle.

Exactly.

Treats the cylinders as a circular list, leading to more predictable wait times for everyone.

So how do operating system designers choose which of these complex algorithms to implement?

What's the practical takeaway for them?

It's a balancing act.

An optimal algorithm is almost impossible to compute in real time.

Too much overhead to figure it out.

Scan and C -scan are generally preferred under heavy load because they prevent starvation, where some requests might get stuck waiting indefinitely if FCFS kept jumping around.

Makes sense.

Modern operating systems like Linux use more sophisticated schedulers that combine these ideas.

For instance, Linux's Deadline Scheduler prioritizes reads over writes.

Because reads often block the user.

Exactly.

And it uses LBA order, but also ensures no request waits too long it has a deadline.

For SATA drives, the Completely Fair Queuing, or CFQ, scheduler tries to anticipate I -O requests from different processes to minimize seeks and provide a more balanced experience across applications.

So it's pretty advanced now.

Very much so.

But when it comes to NVM scheduling, the story changes dramatically.

Because no moving parts.

Since there are no moving parts, the traditional disk scheduling algorithms we just covered, which are all about minimizing head movement, become far less relevant.

Seek time is negligible.

So what do they do?

For NVMs, the most common policy is basically first come, first served.

Often with small optimizations, like merging adjacent requests into a single operation.

Keep it simple.

Pretty much.

For instance, Linux's NODOP scheduler standing for no operation essentially just does FCFS with merging.

While read times on NVMs are generally uniform, write times can vary more because of those flash memory characteristics we discussed.

Like garbage collection and ware leveling kicking in.

Exactly.

That can introduce some variability.

And this leads to a massive performance contrast, right?

HDDs versus SSDs, it's almost like comparing a bicycle to a rocket ship for random access.

Absolutely.

For random IO operations per second, or IOPS, HDDs might deliver a few hundred.

Maybe.

Okay.

SSDs, in stark contrast, can deliver hundreds of thousands of IOPS for random access.

Hundreds of thousands?

Wow.

The gap is enormous for raw sequential throughput, like reading a big file.

Yeah.

The difference is still significant, but less dramatic, as HDDs can stream data efficiently once the head is in position.

Right.

Once it gets going.

But remember that NVM write performance can degrade as the device fills up or ages due to all that internal garbage collection and ware leveling.

That re -amplification thing.

That's the one.

A direct result of write amplification.

So performance isn't always perfectly consistent.

Got it.

Okay.

Let's shift to keeping data safe.

Error management and structure.

Crucial stuff.

This brings us to error detection and correction, or ECC.

Data corruption is a real threat, whether it's a bit flipping in memory due to a cosmic ray, or a tiny flaw on a disk surface.

It happens.

It does.

Error detection is simply identifying that a problem has occurred.

Error correction goes a crucial step further and actually fixes it.

Okay.

Detect versus fix.

Simple methods include parity bits, where an extra bit helps detect if a single bit flip in a byte.

Chexums, like CRCs used in networking, are more robust for detecting multiple bit errors.

I've heard of Chexums.

But ECC, used per sector in HDDs and per page in Flash, is truly powerful.

When data is written, an ECC value is calculated based on the data and stored alongside it.

Okay.

On read, that ECC is recalculated from the data read, if there's a mismatch.

It knows something's wrong.

It knows.

And for minor corruption, the ECT algorithm can actually pinpoint and correct the flipped bits.

We call that a soft error.

It heals itself.

In a way, yes.

If too many bits are changed, though, it's an uncorrectable hard error and data is likely lost.

This ability to self -heal minor errors is a key differentiator between, say, consumer grade and enterprise grade storage products.

Enterprise stuff needs more robust ECC.

Definitely.

The operating system is also responsible for fundamental storage device management.

Before any new device can store your files, it undergoes a low -level formatting process.

Usually at the factory.

Usually at the factory, yeah.

This divides the raw media into sectors or pages, adds necessary internal headers and trailers, and sets up that initial logical -to -physical mapping we talked about.

Okay.

Preparing the surface.

Exactly.

After that, the OS performs three main steps when you set up a drive.

First, it partitions the device.

Like C drive, D drive.

Kind of.

It divides the device into logical groups of blocks or pages.

Think of it like dividing a giant storage locker into several smaller, separate compartments.

One for your operating system, one for temporary swap files, another for your personal documents.

Each partition then acts like a separate logical drive.

Got it.

Second is volume creation.

This is where these partitions, or sometimes even multiple physical drives, can be combined into larger logical units, maybe like a read set.

Okay.

Grouping partitions or drives.

And third is logical formatting, or creating the file system on these partitions or volumes.

This is where the OS sets up its internal maps of free and allocated space, like a table of contents, and creates the initial empty directory structure.

So it can actually store files.

Right.

To improve efficiency, these blocks are often grouped into larger units called clusters or allocation units by the file system.

Okay.

Let's unpack this.

With all these layers, formatting, partitions, volumes, file systems,

how does a computer actually start up from storage?

It seems like a complex dance to get to that login screen.

It absolutely is.

It's a neat process.

That's where the boot block comes in.

The very beginning.

Right.

When you hit the power button, a tiny initial program stored in your motherboard's firmware, usually NVM flash memory on the board itself kicks in.

The BIOS or UFI.

Exactly.

This bootstrap loader is just smart enough to find and load a more complex full bootstrap program from specific boot blocks on your main storage device.

Where are these boot blocks?

On traditional PC systems, this is often the master boot record, or MBR, located on the very first logical block of the drive.

The MBR contains both a small piece of executable boot code and the partition table.

And the map of partitions.

Right.

Which points to the actual partition where your operating system lives, the boot partition.

This MBR code then loads another boot program from the boot sector of that specific boot partition, and that program finally loads the full operating system kernel into memory.

Wow.

It's a chain reaction.

Firmware loads MBR, MBR loads partition boot sector, partition boot sector loads OS.

You got it.

A carefully orchestrated sequence to bring the whole system to life from storage.

And what about when parts of the drive just stop working?

That's got to be a common problem, especially with spinning disks, bad blocks.

It's an inevitable reality, especially with HDDs.

Given the precision mechanics and tiny tolerances, individual sectors will occasionally become defective.

We call them bad blocks.

So what happens then?

Data lost?

On older disks?

Yeah.

You might have had to manually run a tool like bad blocks in Linux to find and flag them and any data on them was simply lost.

Not ideal.

Not at all.

But modern drives are much smarter.

Their built -in controllers maintain a bad block list internally.

If a logical block goes bad during operation,

the controller automatically substitutes a spare sector for it, using a technique called sector sparing, or sometimes sector forwarding.

So it maps the bad blocks addressed to a good spare one.

Exactly.

So if logical block 87 fails, the controller secretly reroutes requests for block 87 to a healthy spare sector, often located within the same cylinder, to minimize performance impact.

The OS never even knows.

Completely transparent.

Very cool.

Another technique is sector slipping, where it might shift all subsequent logical blocks down by one to bypass the bad sector.

For NVMs, it's even simpler.

Because they have that over -provisioning.

Right.

They just use spare pages from their over -provisioned space to manage bad pages internally, with virtually no performance penalty.

The crucial distinction is between those soft errors, which are recoverable thanks to ECC or sparing.

The self -healing ones?

And hard errors, which mean the data in that block is genuinely lost, and you'll need that backup we talked about.

Right.

Backups again?

Always backups.

Okay.

Moving beyond the single computer, let's talk about how storage connects at a larger, more networked scale.

Where does it live?

Good question.

We can categorize this by how the storage is attached.

First, host -attached storage is what we've largely been discussing.

Local, connected directly to your computer via ports like SATA, USB, or Thunderbolt.

This is your personal hard drive or SSD.

Simple enough.

Then there's network -attached storage, or NAS.

This provides access to storage across a network, typically your home or office LAN.

Like a shared drive on the network.

Exactly.

You access it using file -sharing protocols like NFS for Linux or CIF SSMB for Windows, which present the storage as a shared file system.

A more advanced option is ISCSI.

ISCSI.

Yep, ISCSI.

It wraps the SCSI storage protocol the commands disks understand inside an IP network packet.

This makes network storage appear to your computer as if it were directly attached to logical blocks.

So it looks like a local drive, but it's over the network.

Pretty much.

NAS is incredibly convenient for sharing, but generally less efficient than direct -attached storage due to network overhead and protocol translation.

Makes sense.

Next, cloud storage.

This accesses data over the wider internet or a wide area network, often on a subscription basis.

Dropbox, Google Drive, S3.

Exactly.

Unlike NAS, cloud storage is typically API -based.

Applications use specific programming interfaces to access services like Amazon S3, Dropbox, or iCloud.

This design is built to handle the higher latency and potential failures of internet connections that might pause access rather than just fail outright.

Designed for the wild internet.

Finally, we have storage area networks, or SANs.

Okay, sounds serious.

These are generally found in data centers.

They are private, high -speed networks, often using specialized protocols like Fiber Channel or ISCSI, that connect multiple servers to dedicated storage units called storage arrays.

So multiple computers sharing a big pool of storage over a fast, private network.

You got it.

The power of SANs lies in their flexibility.

Multiple servers can share access to centralized storage, and storage can be dynamically allocated as needed.

A storage array itself is a purpose -built device.

A big box full of drives.

A very smart box full of drives, yeah.

With its own controllers, drives, and sophisticated software that implements advanced features like RAID, snapshots, data replication, and more.

They can be all flash for maximum performance, or a mix of SSDs and HDDs for different tiers of speed and cost.

Very enterprise -focused.

Definitely.

Here's where it gets really interesting for reliability.

RAID, or Redundant Arrays of Independent Disks.

Why does it even exist, and what's its purpose?

Because you mentioned earlier, more drives in a system actually mean more potential failure points, not fewer, right?

You've hit on the core paradox, and that's exactly why RAID was invented.

Originally the I stood for inexpensive, the idea being to use many small, cheap disks as an alternative to one large, expensive, and maybe more reliable mainframe disk.

Ah, cost saving.

Initially, yes.

But today the I stands for independent, because the main goals are improving data transfer rates through parallelism, and critically improving reliability through redundancy.

Think about that failure rate.

If a single disk has a mean time between failures, MTBF of 100 ,000 hours, it sounds great, over 11 years.

Yeah.

An array of 100 such disks has an MTBF of only 1 ,000 hours for any disk in the array to fail.

Wait, 100 ,000 divided by 100, 1 ,000 hours, that's only like 41 days.

Exactly.

Unacceptable for critical data.

Something is almost guaranteed to fail fairly quickly in a large array.

Okay, so you need redundancy.

You absolutely need redundancy.

The solution is to store extra information redundant data that isn't normally needed but can be used to rebuild lost data if a drive fails.

How do they do that?

The simplest and often most expensive approach is mirroring, or RAID level one.

Just make two copies of everything.

Exactly.

Every drive is duplicated onto a second drive.

If one fails, data is simply read from the other, the mirror.

The mean time to data loss for a mirrored system can be astronomically high.

Because both would have to fail.

Right.

Assuming independent failures, the chance of both failing simultaneously is tiny.

The key insight here, RAID one is like having an identical twin for every piece of data,

ultimate safety, but you're paying double for storage capacity.

Twice the cost for the same usable space.

Pretty much.

Now, beyond reliability, RAID also offers performance benefits through parallelism, also known as striping data.

Spreading data across drives?

Yes.

In block level striping, blocks of a file are distributed across multiple drives.

The goals are twofold.

Increase throughput for many small accesses by load balancing them across drives.

So multiple requests can happen at once.

Right.

And reduce response time for large accesses, like reading a huge video file, by allowing parallel reads or writes from multiple drives simultaneously.

Like a wider pipe for data?

That's a good way to put it.

Imagine a giant file being split into chunks and read from several drives at once much faster.

So when we talk about the different RAID levels 0, 1, 5, 6, 10,

what are the most important distinctions for someone who wants to understand their trade -offs without getting lost in the technical weeds?

Good question.

Let's focus on the key principles and the most common levels.

The speed demon.

RAID level 0 striping.

This stripes data across multiple drives purely for performance.

There's no redundancy.

None at all.

Zero.

If any single drive fails, all data across the entire array is lost.

Ouch.

High risk.

Very high risk.

The key insight here.

RAID 0 is the ultimate high -speed gamble all -performance zero safety net.

Use it only for non -critical, temporary data like video editing, scratch space, maybe.

The safety net.

RAID level 1 mirroring.

As we discussed, this duplicates every drive.

High reliability, fast reads can reach from either mirror, but expensive as it doubles the storage required.

It's for crucial data where immediate recovery and simplicity are paramount.

Okay.

Speed versus safety.

What about the others, like RAID 5?

The efficient protector.

RAID level 5 distributed parity.

This is super common.

Instead of full duplication, it uses parity.

Parity?

What's that?

Think of it like a calculated checksum, but one that allows reconstruction.

For every stripe of data blocks across the drives, say on end drives, a single parity block is calculated like an XOR sum and written to one of the drives.

If any one drive fails, you can use the data from the remaining drives plus the parity block to mathematically reconstruct the data from the failed drive.

Wow.

Okay.

So you only need one extra drive's worth of space for protection.

Essentially, yes, for single drive failure protection.

RAID 5 cleverly distributes this parity block across all the drives in the array so no single drive becomes a bottleneck for writes.

The key insight.

RAID 5 offers a good balance of cost efficiency, uses less space than mirroring, and strong protection against single drive failures.

Great for large storage needs.

And RAID 6.

More protection.

The double protector.

RAID level 6 dual parity.

Exactly.

RAID 6 takes it a step further.

It calculates two independent parity blocks for each stripe using more complex math, like Reed -Solomon codes.

So it can lose two drives.

Correct.

RAID 6 can tolerate the failure of any two drives in the array simultaneously and still rebuild the data.

This is increasingly important as drives get larger and rebuild times get longer.

You don't want a second drive failing during a long rebuild.

The insight.

Even better protection than RAID 5 at the cost of slightly more storage overhead, two drives worth for parity, and potentially slower writes due to more complex calculations.

What about combinations like RAID 10?

The best of both worlds.

Right.

1 plus 0 or RAID 10.

This combines mirroring and striping.

You first create mirrored pairs, RAID 1, for safety.

And then you stripe the data across these pairs, RAID 0, for speed.

So mirror first, then stripe.

Right.

It offers the high reliability of RAID 1 and the high performance, especially write performance, of RAID 0.

It's often used for databases.

The downside.

It's expensive, requiring double the drives like RAID 1.

There's also RAID 0 plus 1, stripe, then mirror, which is slightly different in how it handles failures.

RAID 10 is generally preferred.

Okay.

That gives a good overview.

And RAID can be implemented in software within the operating system on a dedicated hardware card, a host bus adapter or HPA, or directly within that specialized storage array we mentioned.

Many systems also include hot spares.

Ready to go replacement drives.

Exactly.

Idle drives that automatically kick in and start rebuilding the array if a primary drive fails, reducing the window of vulnerability.

How do system designers choose the right RAID level?

It seems like a complex decision with lots of trade -offs.

It absolutely is.

They weigh several factors.

The cost per gigabyte, the required performance, especially read versus write patterns of the And critically, the acceptable level of reliability and how quickly they need to be able to rebuild after a failure.

Rebuild time is a factor.

Huge factor.

Rebuilding a large RAID 5 or 6 array can take hours, even days, and the array is often slower and vulnerable during that time.

RAID 1 is the fastest to rebuild.

Just copy the mirror.

RAID 5 is a good balance for moderate data volumes.

RAID 6 is increasingly common in large storage arrays for its superior protection against multiple failures.

So no single best answer.

Definitely not.

It's always a trade -off.

Speed, cost, or protection, you usually get to pick two out of three.

Right.

Now, while RAID protects against physical drive failures, it's crucial to understand its limitations.

It's not a magic bullet.

Far from it.

RAID does not protect against corrupted pointers within your file system, torn writes, where a write is only partially completed due to power loss, controller failures, or software bugs in the RAID implementation itself.

So the RAID controller could mess up and lose data.

It could.

A bug in the RAID software or a faulty controller could still lead to total data loss, despite the redundancy of the physical drives.

RAID protects the hardware, not necessarily the logical integrity of the data on it.

What's fascinating here is how cutting -edge file systems like ZFS and BTRFS address these that traditional RAID alone can't handle.

It feels like they're taking reliability to a whole new level.

They absolutely are.

ZFS and BTRFS take a fundamentally different approach to data integrity.

They use internal checksums for all blocks, both your actual data and the file system's own metadata, the pointers, directories, et cetera.

Checksums for everything.

Everything.

And crucially, these checksums are stored with the pointer to the block, not within the block itself.

Why is that important?

Because if you read a block and its contents don't match the checksums stored in the pointer, the file system knows immediately that the data is corrupt before it even gives it to you.

It detects corruption at the source.

Exactly.

And if that block has a mirror copy, ZFS, BTRFS handle their own mirroring parity similar to RAID, it can automatically retrieve the good copy, fix the bad one, and report the event,

silently correcting errors that traditional file systems wouldn't even notice.

Wow.

That's like the self -healing file system.

Pretty much.

Think of it like a smart librarian who always checks the book's integrity using an external record before giving it to you, and if it's damaged,

immediately grabs a perfect copy from the back.

This provides a much higher level of end -to -end data consistency.

That sounds way better than just relying on RAID.

It addresses different kinds of errors.

Another common problem with traditional storage setups is inflexibility.

RAID sets often create fixed size volumes.

Like a partition you can't easily resize.

Right.

Which can lead to wasted space if you over -allocate, or prevent easy growth if you under -allocate.

Resizing them often requires complex, disruptive operations.

That's real headaches.

ZFS and BTRFS solve this by combining the file system and volume management into a unified pool of storage.

A pool?

Yeah, you just give ZFS, BTRFS a bunch of disks, and it creates a storage pool.

All free space in the pool is shared and dynamically allocated across all file systems you create within it, similar to how memory is managed using malloc and free -end programming.

So no fixed partitions within the pool?

Nope.

This eliminates artificial limits, allowing file systems to grow and shrink on demand without complex repartitioning or moving data around.

Much more flexible.

That sounds incredibly useful.

It really is.

Finally, let's look at object storage.

The cloud's foundation.

Okay, something different again.

A very different paradigm from traditional file systems, which organize data hierarchically in folders and files.

In object storage, data isn't put in a named file in a specific folder.

Instead, data is simply placed into a massive flat storage pool and assigned a unique identifier, an object ID.

Just a giant bucket of data with ID tags.

Essentially, yes.

You access the data using this unique ID, not a file name or path.

It's more computer -oriented than user -oriented.

What are the basic operations?

Super simple.

Create an object, you put data in, you get a unique ID back.

Access an object using its ID.

Delete an object using its ID.

That's pretty much it at the core.

How is it managed and protected then, if not with file systems and RAID?

Management software, like the Hadoop file system, HDFS, or Ceph, handles where these objects are physically stored, across potentially thousands of commodity servers and disks.

Protection is usually done via replication.

Making multiple copies.

Exactly.

Instead of complex RAID calculations, it just stores, say, three copies of each object on three different computers, or even in different racks or data centers.

If one copy is lost, it just uses another one and makes a new copy in the background.

Simpler, maybe, but needs more raw space.

Potentially, yes.

But it makes the system incredibly cost -effective and horizontally scalable.

You just keep adding more standard computers and disks to the pool to expand capacity to petabytes or even exabytes.

Just add more nodes.

Right.

A key feature is that objects are often self -describing and content -addressable.

They can carry their own metadata, and you can sometimes find them based on their content, not just their ID.

They excel at storing massive amounts of unstructured data photos, videos, backups, logs, virtual machine images.

So what does this all mean for you, the everyday user?

You might not directly interact with object storage APIs, but it sounds like it underpins a massive amount of the digital world we use.

It absolutely does.

Think about it.

Google's search index, your Dropbox files, Spotify's entire music catalog, Facebook photos, Netflix videos, Amazon S3, which powers countless websites and apps.

All that stuff is likely on object storage.

Almost certainly.

It's the invisible, incredibly scalable backbone of modern cloud services.

It enables them to handle colossal amounts of data efficiently and reliably, without you ever having to think about where the cloud actually is, or dealing with partitions or RAID levels.

Wow.

That puts it in perspective.

What an incredible journey we've taken today.

From the delicate, almost microscopic dance of a spinning hard drive's head.

Yeah, flying just nanometers above the surface.

To the intricate algorithms managing flash wear and tear, the clever scheduling tricks, the layers of protection with ECC and RAID.

And then onto these fundamentally different approaches like CFS, BTRFS, and the massive scale of object storage underpinning the cloud.

It really gives you a much deeper appreciation, doesn't it?

For the complex systems working tirelessly behind the scenes just to store, retrieve, and safeguard your digital life.

Absolutely.

These concepts, storage structure scheduling,

error correction, redundancy,

they are foundational to virtually everything you do on a computer, phone, or even in a smart appliance these days.

It all needs to store data somewhere, somehow.

And it makes you wonder, doesn't it, what's next for mass storage?

We've gone from mechanical spinning rust to electrical flash to network drives to these distributed object systems.

The pace of change is incredible.

Will we see truly intelligent storage that anticipates your needs before you even ask?

Or perhaps even entirely new paradigms like quantum storage or maybe even biological storage using DNA.

Who knows?

The evolution definitely never stops.

It's an exciting field.

It really is.

Well, thank you for joining us on this deep dive into mass storage structure.

We hope you feel more informed and maybe a little more intrigued about the hidden complexities of your digital world.

Hope it was useful.

Until next time, keep digging.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Mass-storage systems form a critical layer in operating system architecture, enabling persistent data retention and efficient access to large volumes of information. Understanding the physical and logical structures of storage devices is fundamental to grasping how operating systems manage data at scale. Hard disk drives operate through mechanical principles involving spinning platters, read-write heads positioned above concentric tracks, and sectors that hold data blocks. The time required to access data involves three components: seek time represents the movement of the head across tracks, rotational delay accounts for waiting for the desired sector to rotate beneath the head, and transfer time is the duration needed to move data between the device and memory. Operating systems employ various disk scheduling algorithms to optimize access patterns and minimize overall latency. First-come-first-served scheduling processes requests in arrival order but often results in excessive head movement. Shortest-seek-time-first reduces travel distance by servicing the request nearest the current head position, though it can starve distant requests. The SCAN algorithm moves the head systematically across the disk surface, servicing all requests in one direction before reversing, while C-SCAN confines this movement to one direction only. LOOK and C-LOOK variants optimize these approaches by moving only to the furthest request in each direction rather than traversing the entire disk. Solid-state drives eliminate mechanical components, delivering significantly lower latency and higher throughput than traditional disks, but face wear-leveling challenges since flash cells degrade with repeated write cycles. RAID technology improves reliability and performance by distributing data across multiple independent disks using techniques such as mirroring and parity calculations. Beyond scheduling, storage management encompasses low-level formatting that prepares disk surfaces, partitioning that divides physical disks into logical units, and mounting that integrates storage devices into the file system hierarchy. Swap-space management extends physical memory capacity through virtual memory, allowing the system to store less-frequently-used memory pages to disk. Tertiary storage devices including optical media and magnetic tape provide cost-effective archival capabilities for infrequently accessed data. Effective storage management balances speed, reliability, and cost through careful consideration of buffering strategies, caching policies, and hierarchical storage organization.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 11: Mass-Storage Structure: HDD, NVM, RAID, and Swap-Space Management

Related Chapters