Chapter 5: Replication in Distributed Data Systems
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
The inherent difficulty in replication is managing changes to data across multiple machines. The chapter thoroughly examines three fundamental architectural approaches: single-leader replication, multi-leader replication, and leaderless replication. In the single-leader model (master–slave), one designated node handles all writes, which are then propagated to followers through a log or change stream. This architecture requires careful consideration of synchronous replication (guaranteed consistency but sensitive to node failures) versus asynchronous replication (faster performance but potential data loss during failover). Handling a leader failure involves a tricky failover process—detecting the failure, choosing a new leader (a consensus problem), and reconfiguring the system—which risks issues like losing recent writes or generating a split brain scenario where two nodes mistakenly believe they are the leader. Implementation details include statement-based replication (which suffers from non-deterministic functions), Write-Ahead Log (WAL) shipping (which tightly couples the storage engine to the replication protocol), and the preferred logical log replication (which decouples the log from the storage engine). Since asynchronous followers may lag, the system exhibits eventual consistency. This lag introduces application anomalies that require mitigation, specifically ensuring read-after-write consistency (users seeing their own submitted data), monotonic reads (preventing users from seeing data move backward in time), and consistent prefix reads (ensuring causally related writes appear in the correct order). The text then introduces multi-leader replication (master–master) as a solution for multi-datacenter operations or offline clients, providing superior local write performance and fault tolerance. The major drawback is the requirement for conflict resolution, since concurrent writes can occur; resolution techniques range from Last Write Wins (LWW)—which risks data loss—to using advanced structures like Conflict-Free Replicated Datatypes (CRDTs). Finally, leaderless replication (Dynamo-style) abandons the concept of a single leader, allowing any replica to accept writes. This model relies on quorums for read and write success, defined by n (total replicas), w (writes required), and r (reads required), where convergence is expected if w+r is (greater than) n. Stale data is resolved via read repair or anti-entropy processes. To maintain availability during network partitions, sloppy quorums and hinted handoff are employed, though they weaken consistency guarantees. To accurately identify and merge concurrent writes (siblings), leaderless systems utilize per-replica counters known as version vectors (or causal contexts).