Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement, not replace the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome back to the Deep Dive.
So last time we really got into the weeds on batch processing, you know, where we're dealing with bounded data, data that has a clear start and a clear end.
Think like a finished report or a huge, but ultimately finite archive.
Okay, but let's unpack this a little.
What happens when the data just never stops coming?
I mean, when you're dealing with the constant firehose of sensor readings or user clicks or global market updates, you can't just wait for the input to be complete because, well, it's never gonna be complete.
Exactly, and that's where we enter the world of stream processing.
Most real world data, you know, whether it's user activity or financial transactions, it just arrives incrementally over time.
This kind of data flow is fundamentally unbounded.
And if you try to force that continuous flow into a batch model, you have to artificially chop it up, you know, run a job every hour or maybe every day.
The biggest problem with that, the huge architectural drawback, is latency.
You're only seeing the state of the world as it was an hour ago.
Stream processing just abandons those fixed slices.
It's all about processing every single event as it happens, continually.
The number one priority is low delay.
And our mission for this deep dive is what to understand how these streams are represented, how they're reliably transmitted, and I guess most importantly, how we process them in real time to build all these responsive applications.
That's it, and how we keep all their different systems like caches and search indexes perfectly in sync.
Okay, so let's start with the absolute basic building block, the event, what is it?
An event is just a small, self -contained, immutable message.
It describes something that happened at a very specific point in time, so it has to have a timestamp.
Like a user adding an item to their cart, or maybe a server reporting its EPE usage.
Perfect examples.
And in any stream setup, you have producers, which are the things generating the events, and consumers, which subscribe to them and do the processing.
Related events get grouped into what we call a topic or a stream.
And the old way of doing this was polling, right?
Just constantly asking a database, hey, got anything new for me, anything new now?
Right, which is super slow and really inefficient.
For low delay, you need immediate notification, and that's where messaging systems come in.
But this leads to two really critical design questions.
Goliability and pacing.
Exactly, first, pacing.
How do you handle back pressure?
What if your producers are just way faster than your consumers?
So do you drop the messages,
or buffer them?
You could drop them, which is sometimes fine for, say, non -critical sensor data.
You could buffer them in a queue, but you risk running out of memory.
Or you could apply flow control, basically telling the producer to slow down until the consumer catches up, kind of like how a Unix pipe works.
And the second question is reliability.
What happens if a server crashes?
Do messages get lost?
And the answer to that really determines your whole architecture.
We see two main approaches.
First is direct communication.
This is the broker -less approach.
Like UDP multicast, which you see in finance for stock feeds,
or webhooks.
Yep, super fast, super low latency.
The downside is your application code has to handle everything.
Message loss, recovery,
and producers and consumers generally have to be online at the same time.
So the more common way is using a message broker.
Right, these are centralized servers that are just optimized for this.
They handle all the messy stuff, like client connectivity and durability.
But the traditional brokers, the ones that follow standards like JMS or AMQP, they often treat messages as transient.
Meaning they're just held temporarily.
Exactly, they get buffered, a consumer acknowledges that it received the message, and then the broker deletes it.
Gone forever.
Which means you can't ever replay old messages if you need to debug something or rebuild a system.
That seems like a pretty big limitation.
It's a huge limitation.
And with these brokers, you see two common patterns for consumers.
There's fan -out, where every message goes to all the consumers.
Simple enough.
And the other is load balancing, where each message goes to only one consumer in a group.
That's for parallel processing.
Yes, but this is where it gets really interesting.
This is where those traditional brokers just,
they hit a wall.
When you combine load balancing with the need for re -delivery, which you always need in case the consumer crashes mid -process, you completely break message ordering.
Okay, you have to explain that.
That sounds like a disaster for anything that depends on cause and effect.
It is.
So imagine a producer sends message M3, then right after, message M4, consumer two gets M3 and starts processing it, but then it crashes before it can acknowledge it.
In the meantime, the broker gives M4 to a different machine, consumer one.
Consumer one finishes with M4.
Now the broker realizes M3 was never acknowledged, so it re -delivers it to consumer one.
So consumer one ends up processing M4, then M3.
The order is completely inverted.
Wow.
Okay, so that whole messages are fleeting and deleted on read mindset is the fundamental problem here.
It is.
So the breakthrough idea was what if we combine the durability of a database with the low latency notification of a messaging system?
And that's the log -based paradigm.
It's the log -based paradigm.
Systems like Apache Kafka or Amazon Kinesis, they just use an append -only sequence of records on disk.
We call it a log.
It's a beautifully simple idea.
And to scale it out, you divide the log into partitions.
Each partition is its own totally ordered log, and they can live on different machines.
Okay, so a topic is just like a conceptual label for a bunch of these physical logs.
Precisely.
And the genius is that messages within a single partition are strictly ordered.
They each get a sequence number called an offset.
So that preserves causality.
And load balancing is just assigning an entire partition to one consumer.
Yep.
It's a coarser -grained parallelism, but it works.
And this leads to the biggest architectural win, the consumer offset.
Instead of the broker having to track acknowledgments for millions of individual messages, which is a huge coordination nightmare, the consumer just periodically saves its current offset number.
Ah, I see.
So you avoid all that bookkeeping.
If a consumer fails, a new one doesn't have to figure out what was processed.
It just looks up the last saved offset and starts reading from there.
It's a radical simplification.
It cuts down on so much overhead.
And because reading the log is a non -destructive read -only operation, fan -out is trivial.
Multiple consumers can all read the same stream without interfering with each other.
But what about disk space?
I mean, the log is always growing.
It's segmented.
So after a certain amount of time, or once a segment hits a certain size, old segments just get deleted or archived off to cheaper storage.
It's basically a big, durable rolling buffer.
So if a consumer gets really slow and falls behind, it might miss messages, but it doesn't break anything for the other consumers.
Exactly.
And this gives you the killer operational feature, the ability to replay old messages.
Because the data is still there on disk.
Right.
If you find a bug in your code, you can just reset that consumer's offset back to the beginning of the week and reprocess everything.
You get that same kind of deterministic reproducibility that we loved about batch processing.
And this connection between streams and databases goes even deeper, doesn't it?
It does.
A database replication log is a stream of write events.
A follower replica is just a stream processor.
Which becomes incredibly important when you're trying to solve that classic painful problem of keeping systems in sync.
Your database, your cache, your search index.
And the sailor mode here, the one you absolutely have to avoid, is something called dual writes.
Right, where your application tries to write to the database and the search index at the same time and just sort of hopes for the best.
And hope is not a strategy, right?
You get a race condition because of network latency and the systems just, they fall out of sync permanently.
You need a human to go in and fix it.
And trying to use something like two -phase commit across different systems, it's just, it's a nightmare.
Prohibitively complex.
So the elegant solution is change data capture or CDC.
Instead of writing to two places at once, you make your database the single source of truth.
You just observe all the changes happening in the database and publish them as an ordered stream.
All the other systems, the cache, the search index, they just subscribe to that stream and apply the changes.
So the database is the leader and everything else becomes a follower.
And that ordered stream prevents the dual write inconsistency.
Correct.
Okay, but hang on.
If the whole idea of streams is immutability, that we don't rewrite history,
how does log compaction fit in?
Because it sounds like it's actively deleting history.
That is a fantastic question.
It really gets to the heart of it.
When you apply compaction to a database change log, it throws away older updates for the same key.
It only keeps the most recent state.
So it does delete history.
It does.
But what you're left with, the compacted log, is effectively a full reconstructable copy of the current state of the database.
So it acts like a continuously updated snapshot of the database, but in stream form.
You could use it to bootstrap a brand new search index instantly without a slow manual data dump.
Precisely.
And this idea is formalized in a pattern called event sourcing where you explicitly design your application around writing these immutable high level events like student canceled course, instead of just low level database updates.
And if you zoom out a bit, you see that the mutable state in your database right now is just the mathematical integral of that immutable stream of events.
There are two sides of the same coin.
Exactly.
Storing the log durably means your state is always reproducible.
And this enables a really powerful pattern called CQRS or command query responsibility segregation.
Where you separate the right model, the event log from all the different read models.
Yes.
Your data analytics team can build their own specialized views from the stream without ever touching or slowing down the main production database.
It gives you incredible flexibility.
Okay, so we have these reliable streams.
How do we actually process them?
Two major applications.
One is complex event processing or CDP.
That's like a search engine for streams looking for specific complex patterns of events.
Any other?
Stream analytics.
This is more about computing stats and aggregations over lots and lots of events.
Which means you need to define boundaries, you need a window.
You do, and that immediately raises a really tricky question.
What clock do you use?
Because events are flying across a network, so clocks won't be perfectly synced.
Right.
Your first option is processing time.
You just use the local clock on the machine doing the processing.
It's simple, but it's very, very fragile.
How so?
Imagine your processor gets bogged down and builds up a 30 minute backlog of events.
When it finally catches up, it processes all those events in a very short burst.
Your analytics dashboard will show a massive artificial spike in activity that never actually happened.
Your metrics are just wrong.
Okay, so that leads us to the more complex, but more correct approach.
Event.
Time.
Right, event time uses the timestamp that was embedded in the event itself back when it actually occurred.
This is the only way to get accurate deterministic results.
But it means you have to deal with events arriving out of order.
The classic Star Wars analogy, right?
An event for a user clicking an ad arrives before the event for the search that showed them the ad in the first place.
Exactly, and that's the non -determinism you have to manage.
So we use a few common window types to handle this.
You have the tumbling window fixed length, non -overlapping.
Like processing sales for every minute on the minute?
Yep.
Then a hopping window, which is also fixed length, but it overlaps, like a five minute window that slides forward every one minute.
And the other two.
A sliding window, which covers all events within a certain time interval of each other.
And maybe most useful, the session window.
This one groups events by user activity and only closes after a certain period of inactivity.
It's perfect for tracking user journeys.
But with any of these, you still have the straggler problem.
A late event that arrives after you've already closed the window and published the result.
And your options are basically to either ignore it or publish a correction, which can get complicated.
Okay, last big topic,
joins.
They're hard enough in a static database.
How do they work in a world of unbounded streams?
They're tricky, there are three main types.
First is a stream stream join.
Like matching a search event with a click event.
Exactly, based on a common key, like a session ID, and within a certain time window.
The processor has to maintain state, basically a buffer of recent events, waiting for its counterpart to arrive.
Then there's the stream table join,
stream enrichment.
This is where you join an activity stream with a database change log.
You're basically augmenting the raw event with more context.
For example, adding user profile information from the database to an activity event.
For that to be fast, the processor probably needs a local copy of that database table, right?
It does, a local in -memory copy that's kept up to date by that CDC stream we talked about.
And the last one, table table join.
Think of it as materialized view maintenance.
You're joining two database change logs to constantly maintain a derived view.
The classic example is joining a stream of tweets with a stream of follower changes to keep every user's timeline cache perfectly up to date.
This all leads to the biggest headache in data engineering,
fault tolerance.
With batch, if a job fails, you just restart it, you get the same result.
That's exactly once semantics.
Or effectively once.
But since streams never end, you can't restart from the beginning.
So frameworks use a few tricks.
Some use micro -batching, breaking the stream into tiny one -second chunks and treating each one like a mini -batch job.
And others use checkpointing.
Right, like checkpointing in Flink, where you're periodically taking consistent snapshots of all the internal state and saving it to durable storage.
Okay, but wait, that sounds good for the internal state, but what about the output?
The side effects.
I mean, how do you stop it from, say, sending the same email twice if a task restarts?
That is the million -dollar question.
And you're right.
Checkpointing alone doesn't solve that external problem.
True, exactly.
Once requires an atomic commit across everything.
The input, the state, and the output.
It's like two -phase commit, and it's very hard to do.
So the more practical engineering solution is often idempotence.
Exactly.
An idempotent operation is when you can run 100 times and get the same result as running it once.
Setting a value is idempotent.
Incrementing a counter is not.
So how do you make a non -idempotent operation safe?
You make it effectively once by including the unique message offset in the output transaction.
Ah, so if the task fails and restarts when it tries to write the output again, the downstream system can see that it has already processed the result for that specific message offset, and it just ignores the duplicate.
That's the secret sauce.
The message offset acts as a deduplication key for your output.
So wrapping this all up, what's the big picture here?
We've seen that stream processing is really the engine for today's world of continuous unbounded data.
We're shifting from processing static files to processing continuous logs of change.
And the real success story here is that log -based model.
It's this beautiful blend of durability and low -latency messaging that lets us treat even our mutable databases as reproducible, immutable streams.
And that powers everything from CDC to real -time materialized views.
We also covered the real complexities of timing.
Why event time is so crucial for accuracy, even if it's a headache to manage, and the different ways we can join data that's constantly in motion.
And a final provocative thought for you to shoe on.
Even though immutability is the architectural goal we strive for with these logs, the real world sometimes gets in the way.
Things like privacy regulations might actually force us to introduce mechanisms to go back and truly rewrite history to excise or shun specific data from the log.
It just goes to show that absolute immutability is an ideal that we sometimes have to compromise on Thank you for joining us for this deep dive into the world of stream processing.