Chapter 4: Availability – Ensuring Reliable Systems
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Availability – Ensuring Reliable Systems concept fundamentally expands upon reliability by integrating mechanisms for recovery and self-repair after deviations occur. Availability relies on masking or repairing system faults—the internal or external causes of an observable failure—to keep service outages within required bounds over a specified period. Quantifying availability often utilizes the steady-state formula of Mean Time Between Failures (MTBF) divided by the sum of MTBF and Mean Time to Repair (MTTR), with demanding systems targeting "5 nines" (99.999 percent) availability, which severely limits acceptable unscheduled downtime. Architectural analysis of availability is guided by a general scenario detailing the fault's source, the stimulus (such as omission or crash faults), the affected system artifact, the operational environment (e.g., normal or degraded mode), the desired system response (like logging or notifying), and measurable outcomes (such as time to detect or repair). To maintain compliance with specifications despite faults, architects employ tactics categorized into detection, recovery, and prevention. Fault detection involves continuous monitoring through tools like system monitors, watchdog timers, ping/echo, heartbeat messages, sanity checks, and voting logic—which can compare results from identical replicas, functionally redundant components, or analytically redundant components. Recovery tactics are split into preparation/repair strategies, such as utilizing redundant spares (hot, warm, or cold), employing rollback to previous known good checkpoints, managing faults through exception handling, and systematically retrying transient operations. Reintroduction tactics allow repaired components back into service via shadow mode, state resynchronization, or escalating restart mechanisms that vary the granularity of the components being rebooted. Finally, fault prevention tactics proactively mitigate risk by removing components from service (therapeutic reboot) to scrub latent faults, leveraging transactions for atomic and consistent state updates (ACID properties), and increasing a program’s competence set to handle more exceptions as part of routine operation. These tactics are often encapsulated within architectural patterns like Active Redundancy, Passive Redundancy, Triple Modular Redundancy (TMR), and the Circuit Breaker, each offering a distinct balance between performance, cost, and recovery speed.