Chapter 4: Encoding, Schemas & Data Evolution

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are cracking open one of the most foundational and I

evolve ability.

We're jumping into the nuts and bolts of data encoding and asking a really simple but profound question.

How do we build systems that are easy to change?

Especially when the only constant is that everything is changing.

It's the essential friction point in any system design.

As your application's features change, the data structures, the formats you use, they have to change too.

But you can't just stop the world, upgrade everything all at once, and then restart.

That means downtime, which is a non -starter.

Exactly.

That leads to this premise that in any large robust system, old code and new code are going to exist side by side running at the same time.

And that coexistence is usually managed through something called a rolling upgrade, right?

That's the one.

You deploy the new version of your service incrementally.

A few nodes at a time, then the next batch and so on.

During that whole period, which could be minutes or even hours, you have processes running version A talking to processes running version B.

And that forces us to have rules, compatibility rules.

If a server running the shiny new version B writes some data, a server running the older version A has to be able to read it without crashing.

And that requires two forms of agreement, really.

The first one is backward compatibility.

The newer code has to be able to read data written by the older code.

That one always seems easier to me.

It generally is because when you're writing version B, you know exactly what version A's data looks like.

You can account for it.

But the real architectural headache is the other way around.

Forward compatibility.

Yes.

Can the older code read data written by the newer code?

If your version B adds a brand new field, version A has to be able to just safely ignore it, keep processing, and not corrupt anything.

So that's our mission for this Deep Dive, to break down how some of these dominant coding formats from the ubiquitous JSON to these highly engineered binary systems like protocol buffers, thrift, and Avro handle these demands.

Right.

Across both data storage and just sending messages over a network.

Okay.

Let's unpack this with the fundamentals.

Programs handle data in two very different ways.

First, you've got the data inside the running program, which is optimized for the CPU.

Yeah.

Things like objects, structs, using pointers, lists, trees, all the complex in -memory stuff.

And second, you have the data that you want to write to disk or send over the network, which has to be just simple contiguous sequence of bytes.

You can't send a memory pointer over the network.

It's meaningless on another machine.

So this conversion from that in -memory structure to bytes is called encoding.

You'll also hear it called serialization.

And the reverse process, turning those bytes back into useful in -memory structures, is decoding.

Or deserialization.

Right.

Now, speaking of decoding, many programming languages have these built -in formats like Java's Serializable or Python's Pickle.

They're super convenient.

They are, but the source material is deeply critical of them and for very good reason.

Security.

Massive security problems.

They're tied to a single language, which is bad enough, but the security flaws can be catastrophic.

When you decode an arbitrary byte sequence with these tools, you are basically trusting that the sender isn't trying to instantiate malicious classes in your environment.

That's the gadget chain attack, right?

Where an attacker crafts a payload that executes code on your machine just by being deserialized.

Exactly.

You've essentially given them the keys to the kingdom.

Plus, they handle versioning terribly and are often, well,

notoriously inefficient.

So any serious system moves beyond them pretty quickly.

Which brings us to the big language -independent textual standards.

JSON, XML, and CSV.

The internet's lingua franca, really.

And they are for a good reason.

Being human -readable is a massive advantage for debugging.

But that text -based nature introduces some subtle and frustrating flaws, especially with data types.

Oh, absolutely.

The biggest one is ambiguity around numbers.

So this is where it gets wild.

If you're using JSON, you can't really tell the difference between the string 42 and the number 42 without some extra logic, but it gets worse with big integers.

Like IDs.

Right.

Think about Twitter's tweet IDs.

They're 64 -bit integers.

That's a number larger than 2 to the power of 53.

And why does 2 to the 53 matter?

Because JavaScript, which is what runs in every browser and processes most of this JSON,

represents all numbers as floating point values.

They only have 53 bits of precision.

Any integer bigger than that gets rounded.

It becomes inaccurate.

So your unique ID is no longer unique.

Exactly.

So to get around this, the Twitter API has to do this crazy hack.

They include the 64 -bit ID twice in the JSON payload, once as the potentially inaccurate number, and a second time as a precise string.

And the client has to know to ignore the number and parse the string.

Right.

It's a mess.

It signals a fundamental weakness in the format itself.

And we see the same kind of problem with binary data.

If you need to embed, say, a little photo thumbnail,

JSON and XML can't do it natively.

So you're forced into using Base64 encoding.

Which is just so inefficient.

You take three bytes of your actual data, and it becomes four characters of text.

That's a 33 % size increase just to cram bytes into a text format.

Because of all that verbosity, we started seeing these binary variants pop up, right?

Things like message pack or BSON.

Yeah.

There are attempts to make JSON more compact while keeping that sort of schemels feel.

But the games are marginal at best.

The example in the source material shows a record that's 81 bytes in plain JSON.

Message pack only gets it down to 66 bytes.

And for that small gain, you give up the single biggest advantage of JSON, which is human readability.

It just doesn't seem worth it.

So the conclusion seems inescapable.

If you want real performance, real compactness, and those strong compatibility guarantees we talked about, you have to use a schema.

You have to commit to an explicit schema, yes.

Which brings us to the heavy hitters.

The schema -driven binary encodings.

We're talking about thrift from Facebook and protocol buffers or protobuf from Google.

And with these, you define your data structures up front, using what's called an interface definition language or IDL.

Then you run a tool that generates code for you in whatever language you need.

And the efficiency jump is massive.

That same 81 -byte JSON record.

It's just 33 or 34 bytes with the compact versions of protobuf or thrift.

That's a huge difference.

And they achieve that incredible compactness by getting rid of the field names in the encoded data entirely.

Wait, okay.

If they omit verbose field names, like username, how does the decoder know which value corresponds to which field?

That is the core innovation.

They use small integer field tags.

These are just numbers one, two, three that you assign to each field in your IDL schema.

When the data is encoded, it's just the tag and the value, no names.

Those tags are the key to compatibility, right?

Once I assign tag number five to, say, email address, that tag is bound to that field forever.

Absolutely.

The tags are immutable identifiers.

You can delete a field, but you can never ever reuse its tag number because old data might still be floating around using it.

Okay.

So let's walk through forward compatibility.

My old code is running.

It gets a message from the new code and it sees a tag it's never seen before, like tag number 12.

How does it safely skip that field without getting totally lost?

This is the really clever part of the design.

When the data is serialized, the encoder writes not just the tag number, but also a little hint called a wire type next to it.

That wire type tells the parser how the data is structured.

Is it a fixed four -byte number or is it a variable length string?

Ah, so even if it doesn't know what tag 12 is, it knows from the wire type how many bytes to jump forward to get to the start of the next field.

Precisely.

It can safely and efficiently skip the unknown data.

That guarantees your forward compatibility.

And for backward compatibility, the main role is that any new field you add has to be optional or have a default value.

Right.

Because if you add a required field and the new code tries to read old data that doesn't have it, the parse will fail.

And your whole rolling upgrade is broken.

Now, this is where Avro comes in.

It came out of the Hadoop world and it's even more compact.

It gets our sample record down to just 32 bytes.

But it does it by throwing out the whole idea of tags.

Avro is fundamentally different.

There are no tag numbers.

The values are just concatenated one after another in the binary stream, which means the raw data is completely unintelligible on its own.

So if there are no hints in the data itself, how on earth does it handle schema evolution?

It does it by strictly separating the writer schema from the reader schema.

The whole idea is that these two schemas don't have to be identical.

They just have to be compatible.

And the Avro library is responsible for resolving the differences.

How does it do that?

It matches fields by name.

So let's say a new writer adds a field called Z.

An old reader gets that data.

The Avro library compares the writer schema to the reader schema.

It sees the reader doesn't know about a field named Z, so it just ignores that value.

Forward compatibility.

Okay.

And the other way.

A new reader gets old data that's missing field Z.

If field Z was defined in the reader schema with a default value, the Avro library sees that the field is missing from the incoming data and it just injects the default value for you.

Backward compatibility.

So you still need default values.

That seems to be a common theme.

It's the key to safe evolution.

But Avro's approach using names instead of tags gives a really unique advantage, especially for, say, data warehousing.

How so?

It's incredibly friendly to dynamically generated schemas.

Imagine you're pulling data from a database.

You can generate an Avro schema directly from the SQL table definition.

If someone adds a new column to the database tomorrow, Avro just handles it because it's matching by column name.

With protobuf or thrift, you'd have to go manually assign a new unique integer tag.

It's much more brittle.

Okay.

So to recap the schema -driven formats, they give you better documentation, compile time safety from the generated code, and really clear, robust guarantees about compatibility.

Exactly.

Now let's look at the three main ways this data actually flows through a system and where these rules become absolutely critical.

First up is data flow through databases.

This is where data lives the longest.

And this is totally governed by that principle that data outlives code.

Data you write today might need to be read five years from now in its original encoding.

Migrations are painful, so you avoid them.

So you obviously need backward compatibility, but you also need forward compatibility, right?

During that rolling upgrade, an old version of your app might have to read a row that a newly upgraded instance just wrote.

Yes.

And there's a really subtle danger here.

It's called the read, modify, write hazard.

Okay.

What's that?

Imagine your old code reads a record written by the new code.

It has a field, Z, the old code doesn't know about.

Because of forward compatibility, it ignores Z, no problem.

But then it modifies some other field, field A, and writes the entire record back to the database.

If the old code isn't designed carefully, it will write back only the fields it knows about A, B, C, and it will have silently deleted field Z.

Oh, wow.

Data loss.

Exactly.

So any robust system has to make sure that old code preserves unknown fields when it does a rewrite.

Okay.

Next up is data flowing through services.

This is the microservices world rest APIs, RPC calls.

And the big architectural debate here really centers on the remote procedure call, or RPC.

The source material argues its fundamental flaw is trying to create location transparency.

The illusion that calling a function on a remote server is just like calling a local function in your own code.

It's a leaky abstraction, because the network is nothing like a local function call.

The network call is completely unpredictable.

The request could get lost.

The response could get lost.

The server could crash halfway through.

And if you get time out, you have no idea what happened.

Did the request even make it?

Did it work, but the reply got lost?

It's totally ambiguous.

So you have to build all this extra logic for retries, for idempotence to avoid doing the same action twice, and you still have all the encoding problems we talked about, like that 2 to the 53 number issue.

It seems like for services, the compatibility needs are a little simpler, though.

You usually update all your servers before you update all your clients.

That's true.

So you mostly need backward compatibility on requests.

So old clients can talk to new servers and forward compatibility on responses.

So new servers can send data back that old clients can still parse.

And our final mode of data flow is asynchronous message passing through a message broker like Kafka or RabbitMQ.

Here, you're decoupling the sender from the recipient.

It acts as a buffer.

It handles re -delivery if a consumer crashes.

It's a very powerful pattern.

And since messages can sit in that queue for a long time, and the consumers and producers are all going through their own rolling upgrades, both forward and backward compatibility are, again, absolutely critical.

It all comes back to the same principles.

To summarize, the choice you make for your encoding format.

It's not just about speed or size.

It's a direct statement about your system's ability to evolve and maintain service during updates.

And you get that compatibility through meticulous use of metadata, either those integer field tags in protobuf or the name -based schema resolution in Avro.

The core lesson, really, across all these patterns, is that your data is going to outlive your current version of the code.

So prioritizing forward and backward compatibility is probably the single most important thing you can do to enable frequent low -risk deployments.

Which brings us to a final thought for you to chew on.

Considering that fundamental mismatch between a local function call and a network request and all these complex compatibility rules, how should that influence your choice for a new API?

Do you lean toward the simplicity and tooling of REST and JSON?

Or is the raw efficiency of a binary RPC system like gRPC worth the added complexity for long -term evolvability?

Something to ponder.

That's all the time we have for this steep dive.

We'll catch you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Handling change in distributed systems requires deliberate strategies for managing data format evolution while maintaining system stability across concurrent code versions. Rolling upgrades—where old and new application versions operate simultaneously—demand bidirectional compatibility: newer code must interpret data written by older versions, while older code must gracefully handle data from newer versions. This fundamental challenge centers on encoding, the process of transforming in-memory data structures into byte sequences suitable for network transmission or persistent storage. Language-specific serialization mechanisms carry significant drawbacks including security risks, poor cross-platform support, and inflexibility around schema changes, making them unsuitable for production distributed systems. Human-readable formats like JSON, XML, and CSV offer language independence and transparency but suffer from type ambiguity—integers exceeding 2 to the 53rd power lose precision when stored as floating-point numbers, and binary data requires encoding workarounds such as Base64. Schema-driven binary formats overcome these limitations through compact representation and strict type semantics. Thrift and Protocol Buffers employ numeric field tags for identification, allowing field names to change without affecting compatibility; new fields must be optional or carry default values to maintain backward compatibility. Avro takes a different approach by using field names for matching and requiring both the writer's schema (for encoding) and reader's schema (for decoding) to be available during parsing, making it particularly effective for dynamically generated schemas. Data flows through systems via three primary pathways: databases, where data persistence outlasts code versions and necessitates careful handling of unknown fields during read-modify-write cycles; services communicating through REST or specialized RPC frameworks, which must account for fundamental differences between local function calls and network requests; and message-passing systems using brokers or distributed actor frameworks that enable asynchronous, decoupled communication patterns. Understanding these mechanisms and dataflow patterns is essential for designing systems that evolve gracefully without disrupting operations or losing information.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥