Chapter 9: S3-like Object Storage System Design

Loading audio…

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

If there is an issue with this chapter, please let us know → Contact Us

Key design requirements include handling 100 petabytes of data, achieving six nines (99.9999%) of data durability, and supporting essential functionalities like bucket creation, object uploading/downloading, and version control. The architecture separates functionality into an API Service, an Identity and Access Management (IAM) component, a Data Store for object contents, and a Metadata Store for management information. Data is written using a primary node that replicates the object to secondary data nodes; this replication strategy directly impacts the trade-off between consistency and latency. To ensure reliability against large-scale disasters, data replication is distributed across different Availability Zones (AZs). Furthermore, data durability is enhanced by comparing standard 3-copy replication (resulting in 200% storage overhead and six nines durability) with erasure coding, such as an 8 plus 4 setup, which achieves superior durability (up to 11 nines) with a much lower storage overhead (50%). To mitigate data corruption, checksums are integrated and verified throughout the data transfer process. Optimizations are discussed for handling various object sizes, including the strategy of merging numerous small objects into larger files to avoid waste of disk blocks and inode capacity. For the Metadata Store, scaling is achieved by sharding the object table using a hash based on the combination of the bucket name and object name. Complex workflows are also detailed, such as multipart upload for efficiently transferring very large files by slicing them into smaller parts, and using garbage collection and compaction to automatically reclaim storage space from deleted objects or abandoned uploads. Object versioning is implemented by treating a newly uploaded object as a new version, inserting a new metadata entry (identified by a TIMEUUID) rather than overwriting the previous record.