Chapter 11: Stream Processing & Real-Time Dataflows

Loading audio…

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

If there is an issue with this chapter, please let us know → Contact Us

Stream Processing & Real-Time Dataflows details methods for transmitting event streams via messaging systems, distinguishing between direct network communication and centralized message brokers. A crucial comparison is drawn between traditional message brokers (such as those using JMS or AMQP, which typically delete messages upon acknowledgment and are optimized for transient communication) and log-based message brokers (like Apache Kafka), which leverage append-only, partitioned logs and offsets to provide message durability, ensure ordering within a partition, and enable multiple independent consumers to read the history of events. This durable log structure is instrumental in implementing Change Data Capture (CDC), which transforms database writes into a verifiable, ordered stream of events, solving the concurrency and inconsistency issues inherent in dual writes when synchronizing derived data systems like caches or search indexes. Furthermore, the paradigm of Event Sourcing is introduced, where application state is explicitly derived from an immutable history of application-level events, cementing the idea that mutable state is merely an integrated, read-optimized view of the changelog. The text then examines three primary applications of stream processing: Complex Event Processing (CEP), which focuses on detecting patterns in event sequences; Stream Analytics, which typically involves calculating statistical metrics or aggregations over time; and Materialized View Maintenance, which ensures derived data systems remain current. A substantial challenge in this domain is reasoning about time, requiring careful distinction between event time (when the event occurred) and processing time (when the system handled it), especially when defining windowing operations—such as Tumbling, Hopping, Sliding, and Session windows—and dealing with delayed or straggler events. The chapter concludes by classifying stream joins into stream-stream, stream-table (often used for event enrichment via CDC), and table-table joins, acknowledging the complexity of time-dependence in join results. Finally, methods for fault tolerance are discussed, including microbatching and checkpointing, alongside the reliance on idempotence or atomic commits to ensure exactly-once semantics (more descriptively, effectively-once) when processing events and managing external side effects.