Chapter 12: The Future of Data Systems
Loading audio…
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
The Future of Data Systems final chapter on data systems design shifts perspective toward future architectural improvements, emphasizing reliability, correctness, and evolvability beyond current practices, suggesting how systems should be built to be robust, correct, evolvable, and ultimately beneficial to humanity. The core architectural challenge addressed is data integration, as modern applications inevitably combine specialized tools (such as OLTP databases and separate search indexes) because no single piece of software is ideal for all data access patterns. The most promising integration approach involves log-based derived data, where primary information is written to a system of record, and other derived views (like materialized views, indexes, or caches) are asynchronously updated using an ordered stream of immutable events, such as those generated by Change Data Capture or event sourcing. This architecture supports unbundling databases, treating databases as collections of loosely coupled components (like replication logs and index maintenance features) that can be distributed across specialized services, which offers better fault tolerance compared to costly synchronous distributed transactions. Reprocessing large datasets via batch processing, often alongside stream processing for low-latency updates, is key for application evolution and flexible schema migrations; while the traditional Lambda Architecture utilized parallel batch and stream systems, modern solutions aim for unified engines to handle both tasks seamlessly. This architectural philosophy is characterized as the "database inside-out" approach, focusing on dataflow programming where application logic acts as deterministic derivation functions that respond to state changes, extending the data's write path all the way to stateful, offline-capable client devices through end-to-end event streams. To ensure robust behavior in the face of faults, the chapter distinguishes between timeliness (how quickly data is updated) and integrity (absence of data corruption), arguing that stream processing systems can achieve strong integrity guarantees (via exactly-once processing and idempotence using end-to-end request identifiers) without sacrificing scale for synchronous coordination. This emphasis on aiming for correctness reinforces the end-to-end argument, which states that fault tolerance must be implemented at the application layer, not solely by low-level system components. Furthermore, the chapter urges engineers to consider the ethical dimension of data systems, specifically warning against the harms of predictive analytics, which can reinforce systemic bias ("machine learning is like money laundering for bias") and lead to social constraints dubbed "algorithmic prison." It characterizes corporate data collection as widespread surveillance, which transfers the right to control private data from the individual to the corporation, demanding a culture of "trust, but verify" through continuous auditing and integrity checks to handle the inevitable reality of hardware faults and software bugs.