Chapter 5: Design Consistent Hashing
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
The core motivation for this approach is solving the rehashing problem encountered in traditional load-balancing methods, such as using the modulo operation, where changing the server pool size causes almost all existing keys to be remapped, resulting in widespread cache misses and system instability. Consistent hashing drastically minimizes this redistribution burden; when the system resizes, only k/n keys, where k is the number of keys and n is the number of slots, need remapping on average. The mechanism utilizes a continuous hash space mapped onto a circular hash ring (such as one generated by SHA-1 hash functions). Servers, based on their identifiers, and data keys are mapped onto this ring. Key lookups are performed by traversing the ring clockwise from the key's position until the first server node is encountered. Although this basic method significantly reduces data movement when servers are added or removed, it suffers from issues like unequal partition sizes between adjacent servers and potential non-uniform key distribution, leading to the hotspot key problem. These weaknesses are mitigated through the use of virtual nodes (or replicas), where each physical server is represented by numerous virtual identifiers placed strategically around the ring. Increasing the number of virtual nodes leads to a smaller standard deviation, ensuring more balanced data distribution, a critical trade-off against the increased space needed to store node data. The chapter also outlines how to find the specific range of affected keys during server changes: when adding a server, keys are redistributed from the newly added node anticlockwise to the next server found; when removing a server, the affected keys lie anticlockwise between the remaining server and the removed server, and these keys are remapped to the next clockwise remaining server. Consistent hashing is utilized in major large-scale systems, including Amazon's Dynamo, Apache Cassandra, Discord, and Akamai CDN.