Chapter 12: The Future of Data Systems

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

We've spent a lot of time unpacking the, let's say, complex and often messy realities of data systems as they are today.

You know, how they scale, how they fail, all the gritty details.

Exactly.

But today we're taking a huge philosophical leap.

We're diving into the final chapter of the material, which asks a totally different question.

How should we build these systems?

It's a fundamental shift, isn't it?

Yeah.

Moving beyond just survival reliability, scalability, maintainability,

and toward, well, purpose.

The goal here is to design applications that aren't just technically sound, but are robust, correct, evolvable,

and ultimately, you know, actually beneficial to people.

It reminds me of that quote, the material sites from St.

Thomas Aquinas.

Oh, the one about the captain.

If the highest aim of a captain was to preserve his ship, he would keep it in port forever.

It's just perfect.

It really is.

It sets the challenge.

If all you do is optimize for preservation, for not sinking, you never actually go anywhere.

You have to push the boundaries a bit.

You have to leave the port.

Yeah.

And to chart that course, we're going to navigate four big themes from the reading.

Data integration,

then this really radical idea of unbundling databases, followed by the pursuit of correctness, and finally.

The big one.

The big one.

The ethical side of things, what the material calls doing the right thing.

Okay.

Let's start with data integration, because in reality,

no single tool does everything well.

You might have your main OLTP database, but then you need a special tool for full text search or another for geospatial queries.

Right.

So you're always having to cobble together these different pieces of software.

And this is where Sony systems fall apart.

The old way, the anti -pattern is something called dual rights.

Where the application code tries to write to the database and update the search index at the same time.

Exactly.

And the material is so clear on this, it's almost a commandment.

Never use dual rights.

It is a guaranteed path to inconsistency.

Because the two systems might process concurrent rights in a different order.

Precisely.

You get into a state of permanent, perpetual inconsistency.

So, if dual rights are forbidden,

what's the alternative?

It has to be this idea of deriving data.

That's it.

You write everything to a single system of record first.

That system establishes the one true order of events.

Everything else, your caches, your search indexes, all of it is just derived from that authoritative source.

Usually we change data capture or event sourcing.

Now, this log -based approach, it sounds a little bit like traditional distributed transactions.

Maybe using two -phase commit or XA.

They also try to guarantee consistency.

It's a good comparison, but they have a very different philosophy.

Distributed transactions, well, they give you strong guarantees like linearizability, but they're brittle.

XA has terrible performance and awful fault tolerance.

If one part fails, the whole thing grinds to a halt.

And the log -based approach.

It's asynchronous.

It prioritizes

success over immediate synchronous failure.

So, you might trade a little bit of timeliness, but what you gain is this incredible robustness.

But that total ordering, forcing everything through a single log, that can only scale so far, right?

Once you have to partition your data, you lose that global order.

You do.

You lose it when you partition.

You lose it with multi -liter replication across different continents.

And you definitely lose it in a microservices world where services own their own state.

And when you lose total order, you invite these subtle problems with causality.

Give us a concrete example of that.

What does a causality bug look like?

The book has a great one.

Imagine a user unfriends their ex -partner on a social network.

Then, right after, they send a rude message to their remaining friends.

Okay.

Now, what if the unfriend event goes to one partition and the message event goes to another?

If the system processes the before the unfriend event, well, the ex -partner might get a notification with that rude message.

The causal link that the unfriend must happen before the message was lost.

Ouch.

That is a memorable bug.

So, the tools we use to manage all this are batch and stream processing.

Right.

Batch for your historical finite data and stream for the unbounded real -time stuff.

Now, historically, people tried to solve this with something called the lambda architecture.

The dual pipelines.

Right.

A slow batch one for correctness and a fast stream one for speed.

The material is not a fan.

Not at all.

And for good reason.

It was double the work.

You had to write and maintain the same logic in two completely different systems.

And merging the outputs was a total mess.

The goal now is to unify them.

Use a stream processor that can also replay historical events.

And this whole idea of using external stream processors to build things like indexes and views, it leads to this bigger concept, right?

Unbundling databases.

Yes.

Think about philosophy of Unix versus a traditional database.

Okay.

Unix is all about small, composable tools that do one thing well, connected by pipes.

A database is a giant monolith storage, indexing queries all in one box.

Exactly.

And the paradox is that today we're taking all those integrated database features, secondary indexes, materialized views, replication logs, and we're building them externally with these stream processors.

So, the data flow across the whole organization starts to look like one enormous, unbundled database.

It's a powerful mental model.

And to shift from a synchronous CRUD, model, create, read, update, delete, to a reactive, event -driven one, you're not pulling a passive database anymore.

Your application code subscribes to streams of state changes and reacts to them.

Which also makes the system more robust, I imagine, if your search index goes down.

The event log just buckers the rights.

The rest of the system is completely unaffected.

It's that loose coupling.

This also deepens the separation between application code and state.

The code becomes stateless, and the state lives in these durable external systems.

And that has huge performance benefits.

Remember the currency conversion example?

Oh, yeah.

Synchronous approach, where you make an RPC call to an exchange rate service for every single transaction.

Versus the reactive approach, where your service just subscribes to the stream of exchange rate updates, keeps the latest rate locally, and serves requests from its own memory.

You've completely eliminated the network roundtrip.

And as the source says, the fastest network request is no network request at all.

This forces us to think really clearly about the boundary between the right path and the read path.

All these derived datasets, caches, and indexes, they exist to manage that trade -off.

They do.

It's about shifting work.

An index shifts work from the read path, which is lazy, to the write path, which is eager.

No index means the read has to scan everything.

Pre -computing every possible query means the right path is infinite,

and index is the compromise.

And this whole philosophy can even extend all the way out to the client, like a mobile app.

Absolutely.

The app gets its initial state, sure, but then it can subscribe to a stream of changes from the server.

It makes it incredibly responsive and allows it to work offline.

It just picks up the event stream where it left off, using the same kind of consumer offset we see in systems like Kafka.

Okay, let's pivot back to something we touched on earlier, correctness.

If we're abandoning expensive synchronous distributed transactions, how do we possibly guarantee data integrity?

By embracing what's called the end -to -end argument for databases.

The idea is that low -level guarantees, like TCP preventing duplicate packets,

they aren't enough.

Why not?

Because the client application itself might retry an operation after time out.

If that operation isn't idempotent, like a money transfer, you could end up sending money twice.

The logic has to live in the application, not just the network layer.

So what's the end -to -end solution for that money transfer?

It's actually beautifully simple.

The client generates a unique operation id, like a UUID, and passes it with the request all the way to the database.

The system then just enforces a uniqueness constraint on that id.

So if the client retries, the second attempt to insert the same id just fails the uniqueness check.

And the operation is guaranteed to happen exactly once.

It becomes idempotent end -to -end.

Okay, that handles a single write.

What about something more complex, like transferring money between two different accounts, which might live on different partitions?

How do you do that without two -phase commit?

This is where stream processing really shines.

It's a three -step dance.

First, the client logs the transfer request with its unique id.

That's one atomic write.

Step one, log the intent.

Step two, a stream processor picks up that request and deterministically emits the derived operations.

A debit for account A's partition and a credit for account B's partition.

And because it's deterministic, it guarantees that if the request exists, both instructions will be created.

Always.

And step three, the downstream processors that manage the accounts apply those instructions.

But crucially, they use the original request id to deduplicate.

So even if they see the same instruction twice, they only apply it once.

You get correctness without any synchronous coordination.

This really forces us to clarify what we mean by consistency because it's such an overloaded term.

It is.

The material splits it into two ideas.

First, there's timeliness.

That's about seeing the most up -to -date state.

A violation there is temporary, you know, eventual consistency.

And the second idea is integrity.

That's the absence of corruption.

Violations there are permanent, perpetual inconsistency.

And that distinction is everything.

If my bank balance is five minutes out of date, that's a timeliness issue.

It's annoying.

If my bank balance doesn't actually equal the sum of my transactions.

That's a catastrophic integrity failure.

And integrity is non -negotiable.

Log -based streaming is fantastic at preserving integrity, even if it sometimes sacrifices immediate timeliness.

And that trade -off allows for something called loosely interpreted constraints.

Right, which sounds a bit scary.

But for many businesses, a temporary violation of a rule is okay as long as you have a plan to fix it.

Think about an airline overbooking a flight.

They accept the temporary violation.

And they have a compensating transaction ready, an apology, a voucher, a rebooking.

If the business can handle that, you can avoid the massive performance penalty of synchronous coordination.

So we build these systems with logs and IDs and guarantees, but we still have to assume things can go wrong.

Bugs happen, hardware fails.

So we have to, what, trust, but verify.

Always.

Continuous auditing is essential.

And event -based systems are amazing for this because you can always take your raw event log and rerun the entire derivation process from scratch.

If the result doesn't match your live state, you've just found a bug or silent data corruption.

Which brings us to the final and maybe most important part of this whole discussion, moving from the technical to the ethical.

Yeah.

The sheer power of these systems means we, as engineers, have to think about how we treat data about people with humanity and respect.

And this gets very real, very fast when you talk about predictive analytics.

Using it to predict weather is one thing.

Using it to predict if someone will default on a loan or reoffend, that's another world.

It is.

The source warns about creating an algorithmic prison where people are systematically shut out of opportunities by opaque models with no recourse.

And these models, they're not neutral.

They amplify bias.

If your historical data reflects societal discrimination.

The machine learning model will just learn to discriminate, but with a veneer of mathematical objectivity.

It's been called money laundering for bias.

It's a powerful phrase.

Algorithms can only extrapolate from the past.

It takes human moral imagination to build a better future.

And this ties directly into privacy and tracking.

When an ad -funded service is free, you're not the customer, the advertiser is.

And the data collection becomes, well, it becomes surveillance.

The material suggests a simple but chilling thought experiment.

Every time you see the word data, replace it with surveillance.

Are you building a surveillance warehouse or a real -time surveillance stream?

It forces you to confront what's actually happening.

And the idea of user consent feels weak when these services are basically essential for social participation.

You can't really opt out of Google or Facebook without cutting yourself off from society.

Exactly.

Privacy is supposed to be the individual's right to decide about their own data.

But we've transferred that right wholesale to corporations, giving them this immense unchecked power.

The analogy in the text is to the industrial revolution.

Yeah, that data is the pollution problem of the information age.

Privacy is our environmental challenge.

We need a culture shift, just like we needed factory regulations and an end -to -child labor.

We have to purge data we don't need and build systems that earn trust.

So let's try to recap this whole journey.

The technical future seems to be this unbundled log -based architecture for data integration.

It is.

It gives you scalability while preserving data integrity, using idempotence and asynchronous flow instead of brittle distributed transactions.

And we learned correctness has to be an end -to -end concern, handled with unique IDs and compensating transactions, which lets you build these incredibly fault -tolerant, evolvable systems.

But that power comes with massive responsibility.

We have to actively fight against algorithmic bias and the normalization of surveillance and ensure that what we build actually serves humanity.

Which leaves us with a final provocative thought for you to take away from this deep dive.

The data systems we are engineering today are literally building the and social landscape of tomorrow.

So the critical question is never just can we collect this data, but should we?

Especially given the risk that one day that data could become a toxic asset, a tool for control in the hands of a future government or an unscrupulous leader.

A vital thought to end on for every single person building these systems.

Thank you for joining us.

We'll see you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Data systems that achieve reliability, correctness, and long-term evolvability require architectural approaches fundamentally different from traditional monolithic database designs. The central challenge in modern application development is data integration, since no single tool optimally handles all data access patterns, forcing practitioners to combine specialized systems like transactional databases with dedicated search indexes and caches. Log-based derived data architecture addresses this fragmentation by establishing a system of record where primary writes flow through an ordered, immutable stream of events generated via Change Data Capture or event sourcing, with all secondary views—including materialized views, indexes, and caches—constructed asynchronously from this authoritative event log. This approach enables database unbundling, decomposing monolithic databases into loosely coupled specialized services that communicate through event streams, improving fault isolation and eliminating the scalability constraints of synchronous distributed transactions. The integration of batch processing and stream processing represents a critical evolution in data system design; whereas the traditional Lambda Architecture ran parallel batch and streaming pipelines, modern unified engines combine both capabilities to support flexible schema migrations and application evolution while maintaining low-latency updates. The "database inside-out" philosophy inverts conventional thinking by treating applications as deterministic derivation functions that respond to state changes, extending immutable event streams all the way to stateful client devices, enabling offline-capable systems with complete end-to-end event provenance. Strong integrity guarantees emerge through exactly-once processing semantics and idempotence mechanisms using request identifiers, allowing stream systems to achieve data correctness at scale without sacrificing performance for synchronous coordination. The end-to-end argument emphasizes that fault tolerance must be primarily implemented at the application layer rather than relying solely on low-level infrastructure components. Beyond technical correctness, data systems design carries profound ethical dimensions: predictive analytics systems risk embedding and amplifying systemic bias through their training data, and widespread corporate data collection represents a form of surveillance capitalism that transfers individual data sovereignty to corporations. Addressing these challenges requires engineering cultures committed to continuous auditing, integrity verification, and transparency—recognizing that hardware faults and software bugs will inevitably occur, and that trust must be earned through rigorous verification practices rather than assumed.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥