Chapter 4: Availability – Ensuring Reliable Systems
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Dive.
We're here to break down complex topics into stuff you can actually use.
And today, we're tackling something really core to building systems.
Is this idea from our sources that, well, technology does not always rhyme with perfection and reliability.
Yeah, that's putting it mildly.
Sometimes things break.
Exactly.
So we're doing a deep dive into software availability, looking at the architectural thinking that accepts that reality and plans for it.
How do architects make systems resilient, not just reliable?
Right.
Our mission today is really to get past just not breaking.
We need to focus on recovery.
Availability, fundamentally, it's the property that the software is there and ready to carry out its task when you need it to be.
It definitely builds on reliability.
You need that foundation, but it adds this critical piece, fast, often automatic recovery.
The goal is really to mask faults so they don't bubble up into failures users actually see.
Masking faults.
I like that.
Okay, so our plan today,
first, let's get the language right.
The core definitions, the math behind uptime.
Then we'll look at the structure that architects use, the availability scenario.
And finally, we'll dig into the practical side, the tactics, the patterns, things like redundant spares, circuit breakers that engineers use to actually hit those, you know, demanding five nines availability targets.
Sounds good.
So let's start with that chain of events, the path towards things going wrong.
Architects have specific terms here.
A fault, that's the root cause.
Could be a hardware, like a bad chip, or a software bug, maybe even something external.
Right.
The initial problem.
Exactly.
That fault can lead to an error.
That's an intermediate state where the system's internal condition deviates from what it should be.
Okay, so it's wrong internally, but maybe not visible yet.
Precisely.
The final stage is the failure.
That's when the deviation becomes externally visible, the system isn't behaving according to its specification, and someone or something notices.
So the whole game in availability architecture is stopping that progression.
Preventing a fault from causing an error that becomes a failure.
That's the core idea, yeah.
Intervention.
And, you know, while we're mostly talking about internal issues, availability definitely touches on other areas, like a denial of service attack.
That's a security problem, but its goal is an availability failure.
Makes sense.
Or think about performance.
If your system is incredibly slow, like egregiously slow, as the source puts it, it might as well be down for the user, right?
So there's a link there too.
Which brings us nicely to measuring this stuff.
The uptime promises.
High availability means hitting specific numbers, doesn't it?
Using that famous formula.
The steady state availability formula, yeah.
It's mean time between failures, MTBF, divided by the sum of MTBF plus mean time to repair, MTTR.
MTBF over MTBF plus MTTR.
Simple looking, but.
But it holds the whole challenge.
You need to maximize the time between things breaking, that's MTBS.
And you need to minimize the time it takes to fix it or recover, that's MTTR.
And for really high availability, that MTTR has to be tiny, like near zero.
Recovery has to be automatic super fast.
Exactly.
And I love the contrast in the source material about MTTR.
It's not always just code, is it?
It could be almost imperceptible.
Or it could be the time it takes to fly to a remote location in the Andes to repair a piece of mining machinery.
Yeah, that really drives home the variability.
MTTR isn't just about algorithms.
It can be a huge physical logistical problem.
And it highlights why observability matters so much.
If the system was down for hours, but nobody noticed.
Did it fail?
Architecturally, yes.
Because it could have been observed failing.
Right.
Okay, so let's talk targets.
People throw around five nines, that's 99 .999 % availability.
What does that actually mean in practice?
Over a whole year, five nines means you can only have, let's see, five minutes and 15 seconds of unscheduled downtime total.
Wow.
That's not a lot of wiggle room.
Not at all.
And that expectation gets formalized in the service level agreement, the SLA.
But here's a crucial bit of detail.
SLA's often exclude scheduled downtime.
Ah, so maintenance windows don't count against your five nines?
Often, no.
That's how companies can take systems offline for planned work and still claim very high availability.
It's defined out of the equation contractually.
Okay.
That's a key distinction.
So if the math is that demanding, how do architects even begin to plan for this?
You can't just start coding redundancy, right?
There must be a structured way.
There is.
It's called the availability general scenario.
It's a formal template really for capturing the specific availability requirements for your system.
A template.
Okay.
How does it work?
It forces you to think through potential faults systematically.
You start with the source, where could the fault come from?
Hardware, software, people, even the environment.
Right.
Then you define the stimulus.
What is the fault?
Is it an omission, like a message not sent?
Yeah.
A crash, incorrect timing,
wrong data.
Okay.
Source and stimulus.
What else?
You specify the artifact.
What part of the system is actually affected?
Processors, memory, storage, network links, then the environment.
When does this happen?
During normal operation or maybe during startup, shutdown, or even repair.
And the most important part, I guess, is the response.
Absolutely.
The response.
What must the system do when this fault occurs?
Should it prevent entirely?
Detect it?
Log it?
Notify someone?
Or recover from it?
Maybe mask it completely?
Maybe degrade gracefully?
That structure seems really valuable.
The example they gave was good.
A server in a server farm fails during normal operation.
And the system informs the operator and continues to operate with no downtime.
Yeah.
That's a perfect concrete scenario.
It maps directly.
Source is hardware server.
Stimulus is a crash.
Environment is normal operation.
And the response is crucial notify, but keep running with no service impact.
So that planning step, using the scenario, justifies all the technical choices that come next.
Precisely.
Once you have those requirements defined, then you start selecting the tools, the availability tactics.
These are the building blocks architects use.
And they generally fall into three buckets.
Fault detection, fault recovery, and fault prevention.
Okay.
Let's take detection first.
Got to spot the problem fast to keep that MTTR low.
What are the common tactics?
The watchdog comes to mind.
Yeah.
The watchdog timer is a classic.
A component, the watchdog has a timer.
The process it's monitoring has to periodically reset that timer.
They call it petting the watchdog.
Petting the watchdog.
Okay.
If the monitored process hangs or crashes, it stops petting, the timer expires, and bang, the watchdog signals a fault.
It's great for detecting processes that just stop omission faults.
Simple but effective.
What about active checks, like pinging?
Right.
So you have pin echo.
That's an active, asynchronous request response.
You send a ping, you expect an echo back.
Good for checking if something is reachable and measuring round trip time.
And how's that different from a heartbeat?
Heartbeat is more passive, usually.
It's a periodic message sent by the component saying, I'm still alive.
Sometimes it's a dedicated message.
Sometimes it's piggybacked on other traffic to save bandwidth.
Pengecho is asking, are you there?
Heartbeat is saying, I'm here.
Okay.
Active probe versus passive announcement.
Got it.
Now things get more complex with voting.
This implies redundancy, right?
Multiple components doing something similar.
Exactly.
Voting needs multiple sources of results to compare.
And there are different ways to get that redundancy.
The simplest is replication.
Just run exact copies of the same component.
Good against random hardware failures, I suppose.
If one server dies, the clone takes over.
Perfect for that.
But completely useless if there's a bug in the software itself, because all clones will have the same bug.
They'll fail together.
The common mode failure problem.
So how do you fight that?
That's where functional redundancy comes in.
Design diversity.
You have different teams build different implementations based on the same specification.
Different code, great job.
Right.
The idea is they're unlikely to make the exact same mistakes.
It's much more expensive, obviously, but it protects against those systematic design flaws.
Okay.
So replication handles hardware faults.
Functional redundancy handles implementation bugs.
What about analytic redundancy?
It sounds like it goes even deeper.
Are we talking about mistrusting the specification itself?
That's pretty much it.
Yeah.
Analytic redundancy aims to tolerate errors even in the requirements or the interpretation of them.
Instead of different code doing the same calculation,
you use fundamentally different ways to arrive at the result.
Like the avionics example.
Yeah.
Calculating altitude multiple ways.
Exactly.
Use radar altitude, barometric pressure altitude, maybe GPS altitude, or calculate it geometrically.
They all measure altitude, but use totally different physics, different potential error sources.
The voter then has to be smart enough to fuse these potentially conflicting results.
Fascinating.
Yeah.
Very complex, but robust.
Okay.
Back to the more specific detection tactics.
I remember the parameter fence.
That sounded clever.
It is pretty neat.
You put a known data pattern, often something memorable like 0xdeadbeef, in memory right after where a variable length argument should end.
If you check later and that pattern has been overwritten.
You know, you had a buffer overflow or some kind of memory corruption.
Precisely.
It's a way to detect memory issues proactively.
And of course, there's timeout.
Super common.
If you expect a response within X seconds and don't get it, you raise a timeout exception.
Critical in any distributed system.
Okay.
So we've detected the fault.
Now we need to bounce back.
Yeah.
Recovery tactics.
This starts with being prepared, right?
Like having a redundant spare waiting.
Yep.
Redundancy is key.
A common recovery tactic is rollback.
If things go bad, you revert the system's state to a previously saved checkpoint.
A known good state, the rollback line.
Makes sense.
And what about upgrades?
Doing upgrades without downtime seems like a major recovery challenge.
Huge challenge.
The goal is in -service upgrades.
Ideally, a hitless SU in -service software upgrade.
This often involves loading the new software onto a redundant spare while the old version is still running live traffic.
Then carefully switching traffic over to the newly upgraded component without dropping connections or losing state.
Exactly.
It's complex choreography, usually leveraging redundancy patterns.
And then there's the compromise option.
Graceful degradation.
What's the idea there?
It means if some part of the system fails, you don't just crash the whole thing.
You keep critical functions running, but maybe drop less important ones.
Like that streaming service example.
Maybe you lose 4K quality and drop to HD or SD, but the video keeps playing.
Better than a black screen.
Right.
Maintain the core service, even if it's diminished.
Okay.
What about bringing a fixed component back into service?
Critical step.
You don't just throw it back in.
The shadow tactic is common.
You bring the repaired component online, feed it live input, let it run in parallel with the active one, but its output is ignored.
So you watch it for a while, make sure it's behaving correctly before you actually rely on it.
Exactly.
You monitor it in shadow mode until you're confident it's stable, then you make it active.
And closely related is state resynchronization.
If you have a standby spare, especially a warm or cold one, you need a way to update its state to match the active component before or during failover.
So the spare isn't starting from scratch.
Right.
And if you do need a restart, you want to minimize the impact.
That's escalating restart.
Don't just reboot the whole server immediately.
Try smaller things first.
Like restart just a process thread.
Yeah.
Maybe level zero is just restarting some child threads.
Level one might be restarting the process, but keeping its state.
Level two might reload the process image.
Level three is the full reboot.
You escalate only as needed, trying to keep that MTTR as low as possible.
Smart.
Minimize the blast radius.
Okay.
Finally, the third category, fault prevention.
Stopping things before they even happen, like taking something offline deliberately.
Right.
Removal from service.
This sounds counterintuitive for availability, but it's preemptive.
Think software rejuvenation or the therapeutic reboot.
If you know a component has, say, a slow memory leak that will eventually cause a crash.
You reboot it proactively during off peak hours, like 3 a .m.
before it actually fails.
Exactly.
That's why your phone or router wants to restart occasionally at night.
It's preventative maintenance.
Okay.
Another big one here is transactions.
Using ACD properties.
Yeah.
Atomic, consistent, isolated, durable.
Using transactional semantics, especially for messaging between components, ensures the operations either complete fully or not at all.
Maintaining consistency, even if failures occur mid -process.
And the common way to do that across systems is two -phase commit, 2PC.
2PC is a standard protocol for distributed transactions.
It works, it guarantees consistency, but the trade -off is significant.
It forces components to wait for each other, locking resources.
It can dramatically increase latency and kill throughput.
So architects have to weigh that consistency guarantee against a performance hit.
It's not a free lunch.
Definitely not.
A major architectural decision.
Okay.
So we've got the language, the math, the scenarios, the individual tactics.
Now let's zoom out.
How are these tactics typically bundled together into larger architectural patterns for availability?
Starting with that redundant spare family.
The choice seems to be about cost versus recovery speed.
Absolutely.
It hinges on how tightly synchronized the state is between the active and spare components.
At one end, you have active redundancy, also called a hot spare.
Hot spare.
Meaning?
All nodes active and spares process all the inputs simultaneously in lockstep.
They maintain virtually identical state all the time.
If the active fails, a spare takes over almost instantly.
We're talking milliseconds.
Fastest recovery.
But sounds expensive and complex to keep them perfectly synced.
Extremely.
High cost, high complexity.
Needed when even a tiny interruption is unacceptable.
Okay.
What's the middle ground?
Passive redundancy or a warm spare.
Here, only the active node process sees the live input.
It then sends periodic state updates checkpoints to the standby.
So the standby is ready, but maybe slightly behind.
Exactly.
Its state is loosely coupled.
If the active fails, the warm spare takes over.
But it might need a few moments, maybe seconds, to process the last checkpoint and fully catch up.
It's a balance between cost and recovery time.
So when would you absolutely need hot spare over warm spare?
That extra cost and complexity?
Think real -time systems.
High frequency trading.
Maybe critical industrial controls.
Places where even a second of lost data or control during failover is catastrophic.
For many web services, a few seconds of potential disruption during failover might be acceptable, making warm spare a better fit.
Right.
And the last one.
The cold spare.
Spare or cold spare.
The standby is completely offline, maybe even powered down.
If the active fails, you have to power on the spare, load the software, initialize everything.
The MTTR is much higher minutes, maybe longer.
It's really only suitable if your availability requirements aren't super strict.
Definitely not five nines.
Okay, clear tradeoff.
Speed versus cost versus complexity.
Another pattern mentioned is triple modular redundancy, TMR.
This sounds like the voting tactic in action.
It is.
TMR is a direct implementation of voting.
You have three identical components, all receiving the same input.
They all compute the output.
And a voter component compares the three results.
And decides the correct one.
Usually just by majority rules.
Typically, yeah.
If two agree and one is different, the voter goes with the majority and flags the dissenter as potentially faulty.
The appeal is its simplicity, conceptually, and it works regardless of why a component failed.
But the downside seems obvious.
Cost and complexity again.
You've instantly tripled your hardware or software instances, and now you have to manage deployment, patching, and monitoring across three units plus the voter itself.
Making sure the voter doesn't become a single point of failure is critical.
It's a heavy architectural lift.
Right.
Managing three of everything isn't trivial.
Okay, last pattern.
The circuit breaker.
This sounds essential for distributed systems.
Oh, it's become incredibly important.
Its job is defensive.
Imagine service A calls service B, but B is failing or super slow.
Without a circuit breaker, A might just keep hammering B with retries.
Potentially overwhelming B even more.
Or exhausting resources in A itself.
Exactly.
The circuit breaker watches for failures and calls to B.
If failures exceed a threshold, the breaker trips or opens.
Now, any subsequent calls from A to B don't even go over the network.
The breaker immediately returns an error to A.
Protecting both A and B.
It stops A from wasting time and resources and gives B breathing room to potentially recover.
After a timeout period, the breaker might allow a single test call through if it succeeds.
The breaker closes and normal operation resumes.
If it fails, it stays open.
So it prevents cascading failures, where one service failure brings down others.
That's the primary benefit.
Preventing those chain reactions.
The main tradeoff is tuning it correctly.
Set the failure threshold too low or the timeout too short.
And you might trip the breaker for temporary glitches, false positives, taking a dependency offline unnecessarily.
Makes sense.
Careful tuning required.
Okay, that feels like a good tour of the landscape.
Yeah.
Really brings home that achieving high availability isn't magic.
It's a very disciplined process.
It really is.
It starts with understanding the math, MTBF, MTTR, and setting realistic goals.
Then using that structured approach, the general scenario, to define exactly what availability means for your system.
And then choosing the right mix of tactics.
Detection, recovery, prevention, and packaging them into robust patterns, like the different spare types or circuit breakers.
It's a systematic engineering effort.
So for you listening, as you think about applying this, here's something to chew on.
A question prompted by the source material.
How does that relentless drive for 24 -7 availability tradeoff against other important qualities, like, say, how easy it is to change the system modifiability or deploy new updates?
That's a tough one.
Yeah.
If a system absolutely cannot go down, even for a second, how do you roll out a major new feature or a fundamental architectural change without breaking that five -nines promise?
That's a really interesting system design puzzle.
It definitely is.
Requires careful planning, often leveraging those advanced upgrade tactics we talked about.
Well, thank you for joining us for this deep dive into software availability and resilience.
Hopefully this gives you a solid framework for thinking about it.
We'll see you on the next deep dive.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- Anti-Entropy and DisseminationDatabase Internals: A Deep Dive into How Distributed Data Systems Work
- Digital Wallet System DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Failure DetectionDatabase Internals: A Deep Dive into How Distributed Data Systems Work
- Hotel Reservation System DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Reliable, Scalable & Maintainable ApplicationsDesigning Data-Intensive Applications
- The Future of Data SystemsDesigning Data-Intensive Applications