Chapter 10: Safety – Designing Safe & Reliable Architectures

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's unpack this.

We are diving into one of the most sobering

and absolutely critical topics in all of software engineering safety.

Absolutely.

This is the quality attribute where the stakes move beyond data loss or a slow system to the very real ability to cause death, injury, or physical destruction.

When zeros and ones are connected to a physical actuator, a valve, a brake, a turbine,

how do architects make sure the code doesn't cause catastrophe?

It sounds dramatic, doesn't it?

But it's, well, it's the reality of modern software architecture.

The underlying mission statement, really, for anyone building these cyber -physical systems has to include, pretty explicitly, don't kill anyone.

We're kind of moving past the theoretical fear of AIs like, you know, HAL 9000.

Yeah, the science fiction stuff.

And confronting real -world failures where code simply makes a bad command, a fatally bad command sometimes.

And that's really the mission of this deep dive, isn't it?

We're gonna look at the foundational concepts laid out by the experts in the field, the frankly terrifying examples, the structured analysis technique.

And most importantly,

the specific concrete architectural tactics you must employ to deliberately design and crucially prove safety into a system.

Exactly, this is how you shift from just sort of hoping your system is safe.

Right, hope is not a strategy here.

To actually engineering its resilience, guaranteeing it as much as possible.

Okay, let's start with the stakes, because honestly, they inform everything that follows.

The bridge between the digital world and the destructive potential is that mechanical actuator.

That physical connection.

Yeah, we're talking about disasters like the 2009 Shushenskaya hydroelectric power station, a faulty control system apparently triggered by an errant keystroke.

Just a keystroke.

Caused a destructive water hammer effect that literally flooded and destroyed the plant.

Software dictated and the hardware obeyed with fatal consequences.

And we know the classic examples, unfortunately, the THERAC -25 radiation machine, overdosing patients because of a really complex race condition in the software.

Horrified, or more recently, the fatal flaw in the Boeing 737MX.

It relied on software, the MCS system, which in turn relied on just one single sensor input.

That complexity.

A single point of failure.

Exactly.

It shows the damage doesn't require malicious intent.

It often just requires a design flaw,

maybe exacerbated by bad information or incomplete testing.

But sometimes the hardware is actually fine, right?

And the failure is purely informational.

I remember reading about Soviet Lieutenant Colonel Stanislav Petrov back in 83.

Oh, the Petrov incident, yes.

His early warning system mistook unusual sunlight reflection off clouds for five incoming US missiles.

The computer was screaming, launch, basically.

High alert.

But Petrov used his human judgment, his intuition, to ignore the system.

He potentially saved countless lives.

That was a failure of the system to correctly classify its input data.

Precisely.

And that inability to handle bad or confusing data is often the core problem.

We saw that with Air France Flight 447 in 2009.

Right, the Atlantic crash.

Yeah.

The plane's engines, the controls, they were working perfectly fine.

But icing caused unreliable airspeed readings.

So the sensors were giving bad data.

Bad data, yeah.

And the pilots misinterpreted the stall warnings, which were apparently quite confusing and intermittent because of the sensor issue.

And they ended up pushing the plane into an irrecoverable stall.

The system gave confusing data and the human couldn't recover.

So if we're gonna define safety formally, it's really about a system's ability to avoid straying into these states that cause injury or loss.

Architects have to anticipate very specific ways the system can fail.

We probably don't need to list all five failure modes mentioned, but maybe we can highlight the, say, three most critical categories.

Sounds good.

Well, the most basic one is omission.

This is just the failure of an event to occur.

It was supposed to happen, but didn't.

Right.

The valve never closes, the critical value never arrives from the sensor, simple absence.

Okay.

Then you've got the opposite, which is commission.

Uh -huh.

This is the spurious occurrence of an undesirable event.

The system does something it absolutely should not have done.

Like the THERAC -25 delivering that lethal dose.

Exactly.

Or just a function performing incorrectly, maybe sending the wrong command.

Okay.

Omission, commission.

What's the third big one?

Well, arguably the most dangerous or maybe the most insidious failure mode involves system values.

You get coarse incorrect values, maybe way outside the expected range, which are usually pretty easy to detect.

Like a temperature reading of absolute zero or something.

Clearly wrong.

Right.

You could usually check for those boundary conditions, but the real danger is subtle incorrect values.

This is the stealth killer, you might say.

The value is just slightly wrong, maybe still within the acceptable bounds defined in the spec.

So simple checks don't catch it.

Exactly.

It's undetectable by simple means.

And this is often what enables a larger failure down the line.

Especially if a crucial calculation relies on that subtly wrong, but seemingly valid piece of data.

Okay, so we know the risks.

We know the types of failures.

Before you even start coding, how do architects proactively find these hidden risks, these subtle errors?

Right.

You need structured analysis.

You can't just guess.

First step is usually identifying the safety critical functions.

What parts of the system absolutely cannot fail catastrophically.

And how do you identify those?

We use techniques like fultery analysis or FTA.

It's a top -down deductive method.

You start with the really bad thing.

You want to avoid the top event, like the patient received an overdose, and you work backward.

Okay.

You identify all the specific component failures, the conditions, the environmental factors that either alone or in combination could lead to that catastrophic top event.

It maps out the failure paths.

That analysis gives you the potential failure points the components involved.

And then you use this structured framework called the safety general scenario to actually design the response, right?

Exactly.

Think of it like filling out a detailed template or table for each potential failure.

It acts as a blueprint for mitigation.

Can you walk us through that structure?

What's in the table?

Sure.

So if you imagine this scenario structure, it starts with a source.

Where did the problem originate?

Was it a sensor, a software component, maybe a user action?

Okay, source then.

Then the stimulus.

What was the specific failure mode?

Was it an omission, like we talked about?

Incorrect data.

A timing failure, too early, too late.

Got it.

Source, stimulus, what else?

Next, you define the environment.

Where was the system doing at the time?

Was it in normal operation, maybe maintenance mode, or already in some kind of recovery state?

Context matters.

Right.

And then the artifacts.

That's just the specific safety critical components that are affected by the stimulus in this environment.

The parts you identified earlier is critical.

Okay, so that describes the problem.

What about the solution part?

Right, the crucial final two parts define the solution.

The response is what the system must do once the problem is detected.

Does it transition to manual?

Shut down safely?

Engage a backup system?

Notify someone?

And the last part?

The response measure.

This is critical.

It's the quantifiable deadline for achieving that safe response.

How quickly must it happen?

Let's maybe use a quick example, maybe not a catastrophic one, just to show how specific this gets, like that heart monitor example.

Oh, right.

So the scenario might state, the source is a primary heart rate sensor.

The stimulus is it fails to report a life -critical value for more than 100 milliseconds.

Omission with a timing aspect.

Okay.

The environment is normal patient monitoring.

The artifact is the patient vital signs display and alerting system.

The response must be log the failure, illuminate a warning light, engage a backup sensor, maybe a lower fidelity one, and resume monitoring using the backup.

And the response measure.

The response measure is all of that must happen within say 300 milliseconds of detecting the omission.

That level of detail, that quantification is absolutely necessary if you want to actually test and certify that the system meets its safety requirements.

Right.

You can't just say it should be fast.

You need the number.

Exactly.

And this clarity allows the system, once it recognizes an unsafe state, to transition through several defensive postures.

It could try to avoid it entirely, recover from it, maybe continue operating, but in a degraded mode.

Like losing some features, but keeping the critical ones.

Yeah.

Or a specific safe mode.

And then there's the ultimate safe response.

Shutting down or what we often call failing safe.

Or maybe just handing control back to a human operator.

Okay.

So we've analyzed the risks, defined the scenarios.

Now we move into the actual engineering solutions, the tactics architects use to make those safe responses actually happen.

There are quite a few listed.

Maybe we can focus on, say, four high -impact ones that really address the kinds of complexity and informational failures we saw in those disaster examples.

Sounds good.

Let's start with unsafe state avoidance.

And maybe the simplest, sometimes overlooked tactic here is substitution.

Substitution.

What does that mean?

It basically means replacing complex software control mechanisms with simpler, often hardware -based protection.

Think physical interlocks, simple watchdog timers.

Things that don't rely on complex code execution.

Wait, hang on.

If modern systems, like you mentioned the 737MX, rely on incredible software complexity just to function,

why is substitution, going back to simple hardware, still considered foundational?

Isn't that sort of anti -modern architecture?

Ah, but it's foundational because software is complex.

Complex software is inherently much harder to prove correct, to prove safe under all conditions.

Hardware interlocks or simple, dedicated safety circuits, they don't consume processing resource in the same way.

They don't have race conditions lurking.

They just work, or they don't.

Pretty much.

They physically stop an operation if a precondition isn't met, regardless of what some potentially buggy software is trying to command.

If you can use a simpler physical or hardware barrier to prevent an unsafe state, you absolutely should consider it.

It's often more robust.

Okay, that makes sense.

A sort of physical guarantee.

Let's move to detection and containment.

If we assume complex software is required, which it often is, one key tactic seems to be comparison.

Right, this usually involves redundancy.

Not just having two components, but typically three or more synchronized, replicated elements doing the same job.

Why three or more?

Why not just two?

Well, with two, if they disagree, you know one is wrong, but you don't know which one.

With three or more, you can implement voting or masking.

Ah, majority rules.

Exactly.

The system can detect an unsafe state because the outputs don't match.

And critically, if one component fails, the other two can outvote it, identify the faulty element, and allow the system to maintain correct operation, or at least transition safely.

That sounds incredibly expensive though.

Are you really asking architects to like triple or quadruple the component count just for one function?

Is that practical for every safety critical piece?

It is expensive, no doubt.

But it's often the price you pay for high assurance, especially in those level A catastrophic systems we'll talk about later.

But we have to be careful about the type of redundancy.

How so?

If we just use replication making exact clones of the hardware and software we protect against random hardware failures,

a chip dies,

the other two keep working.

Right.

But if the design itself has a flaw, a bug in the logic.

Then all the clones will fail in the exact same way at the exact same time.

Precisely.

That's the dreaded common mode failure.

That brings us right back to the 737MX, doesn't it?

Where even if they had redundant computers running MCS, the fundamental design choice relying on that single sensor input was the common flaw.

Exactly right.

So to really combat a common mode failure, you often need functional redundancy, sometimes called design diversity.

The redundant components are built differently.

Like different teams, different algorithms, maybe even different programming languages.

Yes, potentially.

So they perform the same function, but are unlikely to share the same subtle design flaws.

This is absolutely vital for effective containment of software faults.

Okay, let's combine detection and containment into what sounds like a really powerful architectural pattern, the monitor actuator.

Can you explain that one?

Yeah, this is quite an elegant separation of concerns, I think.

You have the main component, the actuator controller, which does the primary job, performs calculation, figures out the command to send.

Okay, the doer.

Right.

But critically, you also have an independent monitor component.

This monitor performs its own checks.

Maybe it uses a much simpler, highly assured model of the system,

or just does sanity checks on the controller's proposed command before that command is actually released to the physical actuator.

So it's like an independent, skeptical gatekeeper.

Exactly.

If the controller says, open the floodgates,

but the monitor, using its simpler model or checks, knows the reservoir is already full, or the command is unreasonable.

It blocks the command, it says nope.

It blocks the command.

This provides a crucial second line of defense.

It's exactly the kind of mechanism that could potentially have prevented, or at least mitigated, many of the historical failures we discussed earlier.

It decouples the complex calculation from the final safety check.

That seems really valuable.

Okay, one more tactic, this time under recovery.

Let's touch on rollback.

Right, so if an error is detected, despite avoidance and detection measures, rollback allows the system to revert to a previously known good state, a saved checkpoint.

Instead of just crashing or shutting down?

Ideally, yes.

It's often much better than a sudden shutdown, especially in systems that need continuity.

It allows the system to potentially continue operating from a safe point, and it's often combined with attempts to repair state, maybe actively fixing the erroneous condition, if possible, before continuing execution.

So building on those individual tactics,

architects also use broader patterns.

One important one is the separated safety pattern.

Separated safety?

Yeah, since safety certification, proving the system meets rigorous standards, is incredibly costly and time consuming.

Right, the paperwork alone must be immense.

Oh, it is.

So this pattern deliberately divides the system architecture into safety -critical portions and non -safety -critical portions.

You build strong firewalls or barriers between them.

And the benefit?

The benefit is you only need to dedicate that super expensive, rigid certification effort to the much smaller, isolated, critical core.

The non -critical parts can be developed using more standard, less costly processes.

It contains the cost and effort.

And that effort, that rigor, it's guided by these formal assurance levels, right?

Like DALs in aviation.

Exactly.

In avionics, the FAA and ESAs use design assurance levels, or DALs.

They range from level A, which covers catastrophic outcomes like loss of the aircraft, multiple fatalities.

The absolute worst case.

The worst case, all the way down to level E, which means a failure would have no effect on safety or aircraft operation.

These levels, A through E, dictate the rigor of the development process, the testing, the documentation required.

They tell you exactly where you need to focus those limited, expensive verification resources.

Makes sense.

Focus the effort where the risk is highest.

Precisely.

And you see similar concepts in other domains.

For example, in industrial controls or railway systems, the IEC 61508 standard defines safety integrity levels, or SLs.

SL4 represents the highest level of dependability, the lowest probability of dangerous failure, down to SL01 for less critical functions.

So it's a common practice across industries dealing with safety.

Very common.

This classification system ensures that everyone involved, designers, developers, testers, regulators, has a common quantifiable understanding of the required safety level and the associated risks.

Okay, so we have the analysis, the scenarios, the tactics, the patterns, the certification levels.

How does an architect pull this all together?

Is there like a checklist?

Pretty much, yeah.

The chapter emphasizes the need for something like the tactics -based questionnaire for safety.

It's not magic, it's just systematic review.

Right.

It forces architects to look at their design decisions critically and ask those specific questions.

Have I considered substitution here?

Is it feasible?

Have I implemented adequate comparison and voting for this critical function?

Is the monitor truly independent from the actuator controller?

So it prompts you to think through all these tactics we've discussed.

That structure helps prevent oversight.

It ensures you've deliberately considered the known ways to build in safety.

And that systematic review, that deliberate design, that seems to be the critical takeaway here.

Software safety isn't something that happens by accident.

Never.

It requires these deliberate structured design choices using things like fault tree analysis, defining clear safety scenarios, and consciously applying these tactics for avoidance, detection, containment, and recovery to anticipate and hopefully mitigate catastrophe.

Hashtag head outro.

Which interestingly brings us back to that provocative challenge mentioned right at the end.

Sometimes the very system designed for safety can actively prevent a necessary action in an unforeseen circumstance.

The FAA -18 Hornet example.

The fly -by -wire software was programmed very carefully to prevent the pilot from performing maneuvers that could put the aircraft into an unsafe flight regime, a stall or a spin, for example.

With standard safety feature, protect the pilot and the plane.

But in one specific test case, apparently the aircraft did somehow enter an unforeseen unsafe state, maybe due to external factors or a combination of failures.

And it turned out the only way to potentially recover was to perform exactly those kinds of violent maneuvers the safety system was designed to prevent.

And the system did its job.

The safety system dutifully prevented the pilot from making those inputs and the aircraft crashed.

It raises a really difficult question.

How do we, as architects, design systems that are both maximally safe, protecting against all the known and anticipated failures based on our analysis, and yet somehow maximally responsive, or at least allow for override, when faced with those extreme, unforeseen failure modes that might demand radical, perhaps counterintuitive, intervention?

Yeah, how do you balance mandated safety with the need for exceptional control, a truly high stakes architectural challenge to mull over?

Definitely something to think about.

Well, thank you for diving deep with us into the structured architecture of safety.

It's complex, but absolutely vital.

My pleasure, it's crucial stuff.

We'll catch you on the next Deep Dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Software systems connected to the physical world through actuators and sensors pose distinctive safety challenges that architects must address through systematic design approaches. When software failures translate into physical harm, injury, or property damage, the architectural foundation becomes a matter of life and death, as demonstrated by notorious cases like the Therac 25 radiation therapy machine or the Boeing 737 MAX grounding. Safety-focused architecture begins by recognizing that unsafe states manifest through specific failure modes: timing problems where events occur too early or too late, sequence violations where operations occur out of proper order, and data corruption including both missing information and unwanted spurious signals. Architects employ a methodical progression of safety tactics organized into three strategic categories. Unsafe state avoidance prevents hazardous conditions before they arise by substituting reliable hardware components or deploying predictive models that provide early warning of impending problems. When avoidance proves insufficient, unsafe state detection mechanisms activate, using timeouts to enforce timing constraints, condition monitoring to observe system behavior, and sanity checking to validate incoming data. Should unsafe states nonetheless occur, containment and remediation tactics limit damage through redundancy strategies that counter common-mode failures via functional redundancy or tolerate specification errors through analytic redundancy. Limit consequences tactics, including abortion of unsafe operations and graceful degradation that preserves critical functionality, maintain minimum acceptable service during failure. Barrier tactics such as firewalls and interlocks enforce proper sequencing and prevent unauthorized state transitions. Recovery mechanisms like rollback to known good states and reconfiguration of system resources enable resumed operation after failures. Established patterns like the monitor-actuator approach and separated safety architecture provide proven solutions that architects can adapt. Rigorous hazard analysis methodologies including failure mode and effects analysis and fault tree analysis systematically identify critical functions and potential failure paths. Across industries, safety integrity levels and design assurance levels categorize failure severity and guide the scope of verification and testing resources required to achieve acceptable risk.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 10: Safety – Designing Safe & Reliable Architectures

Related Chapters