Chapter 8: The Trouble with Distributed Systems
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Distributed systems are defined by the possibility of partial failure, where some components are broken unpredictably while others function normally. The network is a major source of volatility, operating as an asynchronous packet network that offers unbounded delays and no guarantees of delivery, making it impossible for a sender to distinguish between a lost request, a slow remote node, or a lost response. Fault detection consequently relies on timeouts, which are hard to configure accurately because network delays are highly variable, often due to queueing (congestion) in switches, operating system scheduling, or pauses caused by virtualization. This variability is a consequence of using dynamic packet switching optimized for bursty traffic, contrasting with synchronous, circuit-switched networks that guarantee bounded delays at the expense of resource utilization. Clocks introduce additional confusion, as local hardware clocks drift and synchronization via Network Time Protocol (NTP) is inherently limited by network latency. The two types, monotonic (for measuring duration) and time-of-day (wall-clock), serve distinct purposes, but time-of-day clocks are especially problematic due to potential jumps, drift, and leap second issues. Relying on unsynchronized or inaccurate clocks for ordering events, such as in a last write wins (LWW) conflict resolution strategy, can lead to silent data loss because a causally later write may receive an earlier timestamp, violating the expected ordering. Furthermore, processes can experience unbounded pauses due to stop-the-world garbage collection, virtual machine migration, or synchronous disk I/O, which can cause a node to incorrectly assume a time-sensitive resource lock (lease) is still valid, potentially leading to data corruption. To solve these issues, reliable distributed algorithms must be built to function correctly under explicit system models, most realistically the partially synchronous model with crash-recovery faults, and must assume that nodes are honest but unreliable (non-Byzantine). Since a node cannot trust its own isolated knowledge, system-wide decisions, such as declaring a node dead, require agreement from a quorum (a majority vote). To prevent "zombie" nodes that believe they hold an expired lock from corrupting data, the system must employ fencing tokens—monotonically increasing numbers issued by the lock service that the resource server checks to reject any incoming requests with an older (lesser than) token number. The formal correctness of these algorithms is defined by safety properties (nothing bad happens) which must always hold, and liveness properties (something good eventually happens) which require the system to eventually return to a stable state.