Chapter 10: Batch Processing & Data Pipelines

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

So when we think about large scale data systems, I think our mind naturally jumps to that immediate online experience.

Right.

You send a request, you wait a few milliseconds, you get a response back.

We call those services.

Yep, online services.

But behind every major company, there's just this tremendous amount of work being done that, well, it doesn't need that immediate user response at all.

And that work is absolutely foundational.

I mean, if you only focus on the online stuff, you're missing two other critical categories of data processing.

OK, so let's break that down.

We usually see three types.

First, like you said, are the services or the online systems.

They're driven by client requests, and we measure them by response time and high availability.

Makes sense.

Second, you've got stream processing, which is more near real time.

It operates on events very shortly after they happen.

Think latencies of seconds, maybe less.

OK.

But the third category, and this is what we're really digging into today, is batch processing.

Offline processing.

Batch processing.

This is the real heavy lifting, isn't it?

It is.

It's defined by taking these massive input data sets, running jobs that might last for minutes or hours, I mean, even days.

Right, and it produces a complete new output data set at the end.

And since there's no user just staring at a loading screen, we don't care about response time.

The metric we care about is throughput.

Exactly.

How fast can the system just crunch through petabytes of data from the very start to the very finish?

So our mission for this deep dive is to give you that shortcut to understanding distributed batch processing.

We're going to focus on the historical roots, the foundational ideas that came out of MapReduce, to really get the core logic that defines these massive scheduled jobs and why they're still so essential, even with all the new tech out there.

And it's funny, because batch processing feels like this incredibly old idea.

It is.

It's basically pre -digital computing, right?

Like the Hollerith punch cards for the 1890 US census.

That was mechanical batch processing.

A perfect example.

And the principles are perfectly captured by something that engineers still use every single day.

The standard Unix command line pipeline.

Ah, yes.

The original beautiful example of batch processing elegance.

It really is.

I mean, imagine you've got this massive web server log file, maybe terabytes large, and you just want to find the five most popular pages.

Right.

You could go write some big custom program, sure.

But the Unix way is just it's simpler, and a lot of the time it's more robust.

You'd just string a few commands together with pipes.

You would.

You'd cat the log file, pipe it into awk to pull out just the URL from each line.

Right, just the seventh column or whatever it is.

Then you pipe that whole stream of URLs to a sort command.

It alphabetically sorts the entire list.

OK, now everything is sorted.

So you pipe it to Unique Nice C, which just counts all the adjacent identical lines.

And boom, you have your view counts for every URL.

And then a final sort to get the top ones.

A final sort, naashin, to put them in numerical reverse order.

And then head naan5 just grabs your top five.

Simple.

That simplicity, just linking these tiny specialized tools,

it perfectly illustrates a huge architectural trade -off that really plagues distributed systems.

It does.

It's the whole sorting versus in -memory aggregation process.

Just precisely.

A custom program might try to build, say, an in -memory hash table to count the URLs.

A simple frequency map.

And that's blindingly fast.

But it only works if the number of distinct URLs is small enough to fit in your RAM.

Right.

If you have 100 million unique URLs, you're out of luck.

Your process will crash or just thrash the disk trying to swap.

But the Unix sort utility is built for this.

It can handle data sets much, much larger than memory.

How does it do that?

It avoids the RAM problem completely.

It just divides the data into small, manageable chunks, sorts each chunk in memory, and then writes those sorted chunks sequentially to disk.

And then it merges them back together.

And then it merges them.

The key is that it's leveraging sequential disk I -O, which is way more efficient than random access.

And that's how a single machine can crunch through gigabytes of data with no memory issues.

And the scalability, it's all rooted in the core tenets of the Unix philosophy.

It is.

There are, what, four big ones.

One,

make each program do one thing and do it well.

The sort tool only sorts.

It doesn't count.

It doesn't filter.

It just sorts.

Two,

expect the output of one program to become the input of another.

That's composability.

Three is the uniform interface, right?

Files and pipes are just sequences of bytes.

Usually ASCII text.

And the fourth one, this is the crucial one for architects, the separation of logic and wiring.

Meaning the program just uses standard input and standard output.

Exactly.

It doesn't care where the input comes from.

A file, another program, the network, it doesn't matter.

The user, through the shell, gets to wire everything up dynamically.

That loose coupling is what we're always trying to replicate in these huge distributed frameworks.

Always.

But here's the wall you hit.

Unix tools run on a single machine.

Yep.

To handle data that's scaling up to petabytes, you need a distributed version of that simple command pipeline.

And that brings us to MapReduce.

The architecture Google popularized back in the mid -2000s designed specifically to run across thousands

of cheap commodity machines.

To do that, the first thing you need is a distributed file system.

Hadoop's version is HDFS, the Hadoop distributed file system.

And it uses a shared nothing architecture.

Right.

Just commodity machines, each with its own CPU, memory, and disk, all connected by a normal network.

Nothing special.

So how does that work?

How does it create one big file system out of all those separate machines?

Well, a critical piece is the name node.

You can think of it like the central directory.

It tracks where all the different file blocks are stored across all the machines, which are called data nodes.

And for fault tolerance, if a machine goes down.

The files are broken into blocks.

And each block is replicated, usually across three or more different machines.

Some newer systems use something called a razor coding, which is kind of like RAID, but over a network.

It gives you the same fault tolerance, but with less storage overhead.

OK, so with that distributed fault tolerance storage in place, we can finally run a MapReduce job.

Let's walk through the conceptual steps using our log analysis example.

First, the framework reads the input files from HDFS and breaks them into records.

So one line from the log file is one record.

Exactly.

Step two is the mapper function.

The mapper takes each record one at a time, processes it, and spits out a key and a value.

For us, the URL would be the key.

And the value could just be an empty placeholder, or the number one.

And mappers are stateless, which is critical.

They can run independently in parallel on any machine in the cluster.

Then comes step three.

And this is where the distributed magic and the complexity comes in.

All those key value pairs from thousands of mappers are implicitly sorted by key across the entire cluster, a global sort.

And that global sort enables the final step, the reducer function.

Because the data is sorted, every single request for the same URL, no matter which machine originally processed it, is now grouped together and delivered to the same reducer.

So the reducer just iterates over those groups and combines them.

It's the distributed version of our unique niches command.

Exactly.

And that whole process that connects the distributed mappers to the distributed reducers is called the shuffle.

The shuffle.

The framework is smart.

It tries to run the map tasks near the data they need to process.

That's the principle of computation near the data.

You move the code to the data, not the data to the code.

You got it.

And when a mapper finishes, it partitions its output based on a hash of the key.

That hash determines which reducer partition will get that piece of data.

So all keys that hash to the same value go to the same reducer.

Correct.

The mapper writes these sorted partitions to its own local disk.

Then the reducers, who know which partitions they're responsible for, they go and actively fetch.

They download those sorted files from every single mapper.

Wow, okay, so that's a lot of network traffic.

A ton.

They then merge all those files they fetched, making sure to maintain the sort order.

That entire dance partitioning, local sorting, network copying, merging, that is the shuffle.

And because one mapper produced job is often pretty limited, they're usually chained together.

Right, the output directory of job A, which is just a folder in HDFS, becomes the input directory for job B.

Which means you need something to manage all that?

Yep, you need external workflow schedulers.

Things like Uzi or Apache Airflow to manage the complex dependencies and make sure that if one job fails, the downstream stuff doesn't start by mistake.

So one of the most common things you'd use this for is a join operation, right?

Oh, absolutely, one of the most crucial.

We rarely work with isolated data.

You need to correlate records.

You need to join user activity events with their user profiles.

And since MapReduce doesn't really have the indexes you'd find in a normal database, a join often means you have to read all the data.

A full table scan, which is why you need distributed processing to do it efficiently.

So what's the most flexible way to do that join?

That would be the reduce side sort merge join.

You put both data sets, the activity events and the profiles into HDFS.

Mappers go through both.

And for each record, they pull out the common field, the user ID and use that as the key.

And then the shuffle does its magic.

The shuffle is the genius part.

It guarantees that all records for one user ID, the user's profile and all of their activity events end up on the exact same reducer.

Okay, that's powerful.

You can even use a technique called secondary sort to control the order within that key group.

So you can make sure the reducer always sees the small user profile record first and then all the bigger activity events in chronological order.

So the reducer can be really efficient, only holding one user profile in memory at a time.

Exactly.

But this pattern, this bring all the data together pattern, it has a major vulnerability.

Let me guess, skew.

Skew, you got it.

If you have a few hot keys, say all the activity related to a global celebrity or a massive bot account, the single reducer assigned to that key just gets overwhelmed.

It becomes a straggler and slows down the entire cluster.

So how do you deal with that?

You have to implement some compensation logic.

You might sample the data first to find the hot keys.

Then instead of deterministically sending all records for that hot key to one reducer, you, you randomly send them to one of several reducers.

Spreading the load.

Spreading the load.

But it means you now have to replicate the other side of the join the celebrities profile data to all of those random reducers.

So it adds complexity.

But if you can make some assumptions about your data, you can skip the whole expensive sorting and shuffle process with something called a map side join.

Yes.

And the fastest kind is the broadcast hash join.

Which is when?

It's when one of your data sets like the user profiles is small enough to fit entirely into the memory of every single machine.

So you just load it everywhere.

You literally broadcast it.

The framework loads that small data set into a local hash table on every single mapper.

Then the mappers just stream through the huge activity log and do the join lookup locally right there in memory.

No network transfer, no sorting,

maximum speed.

Which brings us away from just the mechanics and more toward the philosophy of the output.

I mean, what does this all mean for operations?

The output of these batch jobs, it's often not just a report.

It's a fully structured new thing.

It could be a complete search index or the entire file structure for a key value store.

And this is where we get to the core philosophy that makes these systems so robust.

Batch jobs should not write directly into a live production database.

No, never.

That just causes performance problems, consistency nightmares.

The better way is to build the entire database file immutably as the jobs output.

Write it to HDFS and then just load it in bulk onto your read -only query servers.

And that concept of immutability, your inputs are unchanged and your output completely replaces the old output.

That's the source of this incredible human fault tolerance.

Meaning if an engineer pushes buggy code that creates bad data, they don't have to scramble to fix a corrupted live database.

They just roll back the code and rerun the job.

The bad output is completely replaced by the good output.

Guaranteed correctness.

And the framework's ability to automatically retry failed tasks safely.

That depends entirely on this immutable output principle.

This open, flexible approach is really what separated Hadoop from the older monolithic MPP databases.

Massively parallel processing.

Yeah.

Yeah.

Those systems required you to do all this meticulous, upfront data modeling.

A schema -on -write philosophy.

Whereas Hadoop basically let you just dump raw data into HDFS.

Creating what we now call a data leak.

You worry about the schema later at query time.

That's schema on read.

It's often called the sushi principle.

Raw data is better.

You can process it however you want later.

But MapReduce, for all its resilience, it has one major performance drag.

Oh, yeah.

The materialization of intermediate state.

It forces you to write results to HDFS between every map and reduced stage.

Which means huge IO and replication latency.

It's like writing a temporary file between every single Unix command instead of just using a pipe.

And that inefficiency really drove the next wave of systems.

It did.

The data flow engines.

Things like Apache Spark, Tez, and Flink.

They moved beyond that rigid MapShuffleReduce pattern.

Instead, they treat the whole workflow as one single job.

And they modeled it as a graph.

They directed a cyclic graph, or DFS.

A DAG of operators, yeah.

And the advantages are huge.

They can avoid unnecessary sorting.

They keep intermediate state in memory or on local disk, which drastically cuts down on that expensive HDFS IO.

Plus they allow for pipeline execution.

Right, so operators can start processing data as soon as the previous operator produces it instead of waiting for the entire stage to finish.

How do they handle fault tolerance without writing everything down?

They rely less on materialization.

If a node fails, the framework doesn't have to start from the very beginning.

The lost data is just recomputed from the nearest available prior stage.

And that's only safe if all the operators are deterministic.

Exactly.

They have to produce the exact same output for the same input every single time.

It's also probably worth mentioning that for certain algorithms, like PageRank, standard MapReduce is just awful.

It's terrible for iterative algorithms.

You'd have to read and rewrite the entire graph state on every single iteration, just massive, massive IO.

So new models emerge for that.

For those graph problems, yeah, you got models like Pregel or Bulk Synchronous Parallel, BSP.

They're implemented in systems like Giraffe.

In that model, the vertices, they remember their state between iterations and they communicate by passing messages only along the edges.

It's way more efficient for graph traversal.

And the final big evolution here is the move toward high -level APIs.

Tools like Hive or Spark DataFrames.

By letting engineers specify their jobs declaratively, often with SQL -like queries,

you empower the framework to use a cost -based query optimizer.

And that optimizer is smart.

It analyzes your query, it analyzes your data, and it automatically picks the best join algorithm, broadcast hash, sort, merge, whatever.

And it can use modern CPU tricks, like vectorized execution.

Which is where the CPU processes records in tight batches instead of one by one, which is just great for CPU cache performance.

It lets these flexible systems get performance that's comparable to those specialized MPP databases, but on commodity hardware.

So to wrap this all up, distributed batch processing, at its heart, it solves two massive problems that are essential for any data -intensive system.

First is partitioning, the sophisticated way of bringing all the related data together in one place.

And second is fault tolerance, that guarantee of correct immutable output, even when individual tasks fail over and over again.

And the essential characteristic that defines all of this that makes it batch is that it deals with bounded input data.

A fixed set of logs,

a database snapshot from a point in time.

It means the job eventually finishes, and it produces a final complete output.

But what happens when the input is unbounded?

If the data never ever stops flowing, the job can never finish.

And if the job never finishes, the entire architecture of the system has to change.

And that, that's the conceptual hurdle that leads us directly into the world of stream processing.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Batch processing represents an offline computational approach optimized for high throughput when working with large, bounded datasets, fundamentally different from low-latency online services and near-real-time stream processing. The conceptual foundation traces back to Unix philosophy, where simple, composable tools such as awk and sort interact through a uniform interface of pipes and files, enabling powerful analysis of datasets exceeding available memory by efficiently spilling data to disk through sorting. This principle scales to MapReduce, a distributed batch framework operating on filesystems like HDFS using shared-nothing architecture to achieve massive parallelism. MapReduce execution proceeds through stages where mappers transform input records into key-value pairs that undergo partitioning, sorting, and shuffling before reaching reducers, which aggregate all values sharing a single key. Common data manipulation tasks including joins and sessionization typically employ reduce-side joins such as sort-merge joins, which exploit the framework's sorting to group related records by key and perform computation locally while minimizing expensive external network requests. Data skew and hot keys—situations where certain keys receive disproportionate volumes—create performance bottlenecks that specialized algorithms address through randomized reducer assignment. Faster alternatives like map-side joins, including broadcast hash joins for smaller datasets and partitioned hash joins for pre-organized data, eliminate the shuffle and sort overhead when data structure permits. Batch systems embrace immutability following Unix conventions: inputs remain unmodified while outputs are completely replaced, a design choice enabling human fault tolerance and safe automatic task retries. Although massively parallel processing databases preceded MapReduce historically, MapReduce achieved prominence through flexibility in handling diverse data formats via schema-on-read and accommodating heterogeneous workloads without upfront schema constraints. Its architecture specifically addressed environments with high task termination rates by prioritizing task-level recovery. Modern dataflow engines including Spark, Tez, and Flink treat entire workflows as unified jobs, reducing intermediate materialization through directed acyclic graphs and techniques like resilient distributed datasets or state checkpointing. Graph-specific problems leverage iterative computation models such as Pregel and bulk synchronous parallel processing, where vertices maintain state across message-passing rounds. High-level declarative interfaces including Hive and Spark SQL introduce query optimization and vectorized execution, merging general-purpose framework flexibility with database-like performance.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥