Chapter 3: Storage & Retrieval Systems
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
For OLTP workloads, the discussion focuses on indexing structures, starting with simple log-structured storage, which relies on append-only sequential writes (highly efficient) and uses indexes (such as in-memory hash maps in systems like Bitcask) to locate values efficiently, although simple implementations are limited in scope and must perform compaction to reclaim space and remove overwritten records. An advancement of this model is the Log-Structured Merge-Tree (LSM-Tree), which uses Sorted String Tables (SSTables) where segments are sorted by key, allowing for sparse indexes and extremely efficient background merging, making it favorable for high write throughput despite suffering from write amplification (where data is rewritten multiple times during compaction). In contrast, the most widely used index, the B-tree, employs a page-oriented approach, organizing data into fixed-size blocks (pages) in a balanced tree structure, enabling fast lookups in a predictable number of disk seeks. Because B-trees update pages in place, they require an additional Write-Ahead Log (WAL) or redo log to ensure durability and crash recovery, as well as concurrency control mechanisms like latches. Moving beyond transactional systems, the chapter explores the need for data warehousing, which utilizes Extract–Transform–Load (ETL) processes to move data from OLTP systems into separate databases optimized for OLAP analytics. These systems often employ schemas like the star schema (linking a central fact table to dimension tables) to organize large datasets. Crucially, analytic workloads benefit significantly from column-oriented storage, which stores all values of a single column together instead of by row. This layout drastically reduces the input/output bandwidth needed by querying only the relevant columns and allows for superior compression through techniques like bitmap encoding or run-length encoding, which, in turn, facilitates efficient vectorized processing on modern CPUs. Because writes are slow in purely column-oriented systems, they often incorporate an LSM-tree approach for handling recent changes. Finally, the chapter notes that performance can be further boosted using materialized aggregates like materialized views and data cubes, which precalculate sums, counts, or averages along different dimensions to accelerate common reporting queries.