Chapter 6: Design a Key-Value Store
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
The chapter provides a foundational exploration of designing a distributed key-value store, a type of non-relational database characterized by unique keys and associated values often treated as opaque objects. The design goals prioritize supporting big data with high availability, low latency, automatic scalability, and tunable consistency, while keeping individual key-value pairs small (lesser than 10 KB). Since a single server quickly reaches capacity, a distributed hash table approach is necessary. This architecture is governed by the CAP theorem, which mandates that a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance, with real-world applications requiring partition tolerance. Designers must choose between a CP system (prioritizing strong consistency over availability, common in financial applications) or an AP system (prioritizing availability over consistency, potentially returning stale data). Data is managed via Consistent Hashing, a technique that evenly partitions the data across many servers and minimizes data movement when nodes are added or removed. To ensure reliability and high availability, data replication occurs asynchronously across N distinct servers, ideally placed in multiple data centers. Consistency across these replicas is managed using Quorum Consensus, where W is the required number of acknowledgements for a write and R is the required number of responses for a read out of N total replicas; strong consistency is achieved if W plus R is (greater than) N. The recommended consistency model for high-availability systems is eventual consistency. To resolve the resulting inconsistencies from concurrent writes, versioning using Vector Clocks (a server, version pair) is employed to detect conflicting versions (siblings) from those that are ancestors. The system handles failure detection through the decentralized Gossip Protocol, where nodes periodically send heartbeats and update a shared membership list. Temporary failures are mitigated by using Sloppy Quorum (selecting the first healthy servers) and Hinted Handoff, where temporary servers process requests for an offline server and hand the data back when it recovers. Permanent failures trigger the anti-entropy protocol, which uses Merkle Trees to efficiently compare replica data by calculating root hashes and only synchronizing the specific inconsistent data buckets, thereby reducing transferred data volume. The overall system architecture is decentralized, utilizing a coordinator node as a proxy. The write path involves logging the request to a commit log, saving to a memory cache, and eventually flushing to SSTables (Sorted-String Tables) on disk. The read path checks the memory cache first and then utilizes a Bloom Filter to determine efficiently which SSTables on disk might contain the requested key.