Chapter 4: Encoding, Schemas & Data Evolution

Loading audio…

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

If there is an issue with this chapter, please let us know → Contact Us

A core architectural requirement during application updates, particularly rolling upgrades where different code versions coexist, is maintaining both backward compatibility (newer code reading older data) and forward compatibility (older code reading newer data). This requires effective translation, known as encoding (or serialization/marshalling), to convert complex, CPU-optimized in-memory data structures into a self-contained sequence of bytes for network transfer or storage. Language-specific encoding mechanisms (e.g., Java’s built-in serialization) are generally discouraged due to security vulnerabilities, inefficiency, and limited support for schema evolution. Standard textual interchange formats like JSON, XML, and CSV are language-independent and popular, but they introduce ambiguity regarding datatypes (such as the distinction between large integers and floating-point numbers—where integers (greater than) 2 to the power of 53 lose precision) and require workarounds like Base64 encoding for binary strings. To achieve superior efficiency and strict compatibility semantics, schema-driven binary formats like Apache Thrift, Protocol Buffers, and Apache Avro are utilized. Thrift and Protocol Buffers rely on predefined, unique field tags (numbers) within their schema to identify data fields, making it possible to change field names but not field tags without invalidating old data. Backward compatibility requires any new fields added to these schemas to be marked optional or have a default value. In contrast, Avro uses the field name for identification and compatibility resolution, achieving the most compact encoding by requiring both the writer’s schema (used for encoding) and the reader’s schema (used for decoding) to be present during the parsing process. This method is particularly friendly to dynamically generated schemas. Finally, the chapter examines three modes of dataflow: dataflow through databases (where the reality that data outlives code necessitates robust compatibility and preservation of unknown fields during read-update-write cycles), dataflow through services using REST (a design philosophy built on HTTP) or specialized RPC (Remote Procedure Call) frameworks, which must account for the fundamental differences between local and network function calls, and message-passing dataflow systems that use message brokers or distributed actor frameworks for asynchronous communication and decoupling.