Chapter 6: Ad Click Event Aggregation Design

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive, the shortcut to being instantly well -informed about the systems that power the digital world.

Today, we're taking a look inside the massive machinery required to make digital advertising work.

Specifically, how you would design an ad -click event aggregation system at the scale of a Facebook or a Google.

It's a really critical topic.

You, the listener, you're probably familiar with the front end of digital ads, the real -time bidding or RTB process.

Right.

But the back end, this aggregation system we're designing today is what makes sure every single dollar spent is accounted for.

And that's what we'll be focusing on.

Okay, so let's start right there with the context.

What is the core difference between the speed needs of RTB and our click aggregation system?

Well, RTB is that lightning fast auction.

You have an advertiser, they go through a demand -side platform or a DSP that talks to an ad exchange, which hits a supply -side platform, SSP, and then finally reaches the publisher.

And that whole chain has to happen.

In milliseconds.

Yeah.

It's all about placement speed, getting the ad on the page.

And that speed is what lets the ad show up, but once the ad is there and a user clicks, that's where our system finally kicks in.

Exactly.

Our system is focused on measurement and more importantly, accuracy, not milliseconds.

We need to aggregate those clicks to calculate all the vital metrics like click -through rate, CTR, conversion rate, and crucially to control campaign budgets and handle the final billing.

So if our data is off, even by a little bit, the financial impact could be immediate and pretty massive.

Huge.

Okay, let's unpack the constraints then.

Before we even start drawing boxes, what's the sheer volume of data we have to ingest every single day?

We have to plan for just a staggering scale.

The input comes from log files spread across thousands of servers.

Each click event is a log entry.

It's got attributes like the added, the precise click timestamp, user at IP, country, if we need to get the idea.

We're required to handle one billion clicks a day, per day, across about two million active ads.

And we have to factor in 30 % year over year growth.

It's always growing.

Let's do some quick back of the envelope math on that, because that volume is gonna dictate the entire architecture.

It really does.

If you take a billion events daily, you're looking at an average query rate of about 10 ,000 queries per second, QPS average.

But peep traffic, say during a major event, that can spike to 50 ,000 QPS.

And if each clicks about 0 .1 KB, that's 100 gigabytes of raw data daily, or around three terabytes a month.

So the takeaway is pretty clear.

We are designing a highly write -heavy system.

But the good news, unlike RTB, is that our latency requirement is a lot softer.

Yes, that is our crucial breathing room.

Since this is for billing and reporting, not for serving the ad in real time, a few minutes of end -to -end latency is totally acceptable.

And that margin lets us prioritize correctness.

Correctness and reliability over just raw speed.

Absolutely.

So functionally, what does the system actually have to do?

What are the key features?

We have three key requirements.

First, it has to aggregate the click count for any specific added over the last, say, M minutes.

So a continuous count.

A running total.

Second, we need a separate way to return the top 100 or top -endmost clicked ads every single minute.

A leaderboard.

Basically, yeah, a leaderboard.

And third, the system must support filtering on all of this by IP, by user ID, by country.

And of course, the non -functionals are non -negotiable.

Correctness for billing, robustness, and handling those tricky delayed or duplicate events.

Which brings us to the API.

What's the contract with the client,

the data scientist or PM looking at a dashboard?

The API surface is actually kept pretty clean and simple.

There are two main endpoints.

You can query for counts using getv1ads .addedaggregatedcount.

And you just specify the time window with from and to parameters.

And for that popular list, the leaderboard.

That's getv1adspopularads.

It returns the top n ads over a defined moving window, M.

And crucially, filtering is supported on both APIs just by adding query parameters.

Okay, now let's talk about the data model.

We need the raw data, the original logs for backup and ML training.

But we also need aggregated data for fast queries.

This brings up the star schema concept, which is a really critical trade -off.

Can you walk us through it?

The star schema is all about supporting fast filtering.

Let's say a user wants to filter by clicks from Germany.

Right.

If we tried to filter on the fly for every request across three terabytes of data, the query would just time out.

It'd be way too slow.

So what's the fix?

We introduce a filtered field.

We pre -calculate results for specific criteria.

These are our dimensions.

And we store them under a unique ID.

So for example, all clicks for added X in country Germany get aggregated and stored together.

Ah, so the trade -off is massive.

You're dramatically increasing the total number of records you store.

Dramatically.

Because we're storing the same click aggregated under maybe 10 different filter dimensions.

But the benefit is performance.

The user doesn't hit the raw data.

They hit a pre -computed count, making the query basically instantaneous.

That's a classic systems design trade -off favoring read speed over storage.

Precisely.

And that leads us right to our database choice.

With 50 ,000 peak writes per second, relational databases are,

well, they're just not gonna cut it.

So we have to go NoSQL.

Oh yeah.

We'd look at specialized time series databases like InfluxDB or wide column stores like Cassandra.

For this design, let's go with Cassandra.

It offers fantastic horizontal scaling, which is perfect for sharding our time series data.

Okay, let's structure the high level design.

Given the spiky traffic, we absolutely cannot afford synchronous processing.

We have to decouple everything.

Absolutely.

The whole architecture has to be asynchronous.

We use a message queue like Kafka to decouple producers and consumers.

If the aggregation service slows down, the queue just buffers the load.

No back pressure failure.

So let's visualize the flow for everyone listening.

We've got a log watcher feeding raw events into the first message queue.

Let's call it MQ1.

The data aggregation service reads from MQ1.

Now here's the critical bit you mentioned.

It doesn't write results directly to the database.

It writes them into a second message queue, MQ2.

Why is that second queue so vital?

That second queue, MQ2, is the key enabler for achieving end to end exactly once semantics.

No.

And that is mandatory because of the financial stakes.

Exactly once.

So every click is counted once and only once.

By routing the aggregated results through MQ2, we can implement an atomic commit.

The writer has to confirm the result was successfully placed in MQ2 before the consumer from MQ1 updates its position.

This prevents data duplication and loss.

That makes perfect sense.

You're building a checkpoint right into the middle of the workflow to guarantee integrity.

So let's dive into the data aggregation service itself.

You said it's powered by the Mac produced paradigm modeled as a DA, a directed acyclic graph.

Right.

The DAG structure lets us break the logic into smaller managed computing units.

You have a map node, which reads from MQ1, cleans the data and partitions it, maybe based on a dead modulo two or something.

So the map node is like a traffic cop, making sure similar data sticks together.

Exactly.

Then an aggregate node takes these partitions and counts the clicks per aided in its local memory.

Finally, a reduced node takes results from multiple aggregate nodes to produce the final outcome.

How does that work for the top -end list?

For the top -end adds, each aggregate node calculates its local top -end list using a min -heap structure.

Then the reduced node just has to efficiently merge those smaller pre -sorted lists into the final global top -end result every minute.

That's a clever way to avoid a massive global sort.

Okay, let's shift gears now to some of the bigger architectural choices.

Our design so far sounds a bit like the Lambda architecture.

You know, a stream path for real time and a batch path for historical data.

It does, but Lambda has a serious operational drawback.

You have to maintain two completely different code bases, one for batch, one for streaming.

The preferred, more modern approach we should use here is the Kappa architecture.

What makes Kappa so much better for this?

Kappa relies on a single stream processing engine for everything, for both real -time aggregation and historical reprocessing.

If we need to fix a bug, we don't switch to a separate batch system, we just replay the raw event stream from MQ1 through the same aggregation logic.

It simplifies maintenance enormously.

But you still need to do that historical replay sometimes.

If there's a major bug in the counting logic, how do you recalculate months of data without crushing the live 50 ,000 QPS system?

You use a dedicated recalculation service.

It's a batch job that pulls raw data from archival storage, but it feeds this historical data into a dedicated, isolated aggregation service instance.

The results are then pushed through MQ2.

That isolation is key.

So the intense computation of the replay doesn't degrade the latency of the live system.

Let's talk about time itself, because it's always deceptive in distributed systems.

We have event time versus processing time.

Since accuracy for billing is everything, I'm assuming we have to stick with event time when the click actually happened.

Correct.

Event time gives us the true business accuracy, but that forces us to deal with events that arrive late because of network delays.

We solve this using a concept called a watermark.

Okay, explain the watermark trade -off.

The watermark defines how long we keep an aggregation window open after the event time has passed.

Let's say we set a watermark of 15 seconds.

We'll wait that extra 15 seconds before closing the window just to let slightly delayed events catch up.

So it improves accuracy.

At the cost of latency.

The longer the watermark, the more accurate the data, but the longer your end -to -end delay.

It's a classic trade -off.

And we use two different types of aggregation windows, depending on the job.

For the per -minute click count, we use a tumbling window fixed, non -overlapping chunks of time.

But for the top -end and the last M minutes query, where the window is constantly moving, we use a sliding window.

Okay, let's spend some real time on the trickiest part, which you mentioned earlier.

Once delivery.

This is where the engineering complexity really spikes.

It is by far the most critical feature.

We just can't afford the financial risk of standard, at least once delivery, which can duplicate data if a server fails at just the wrong moment.

Right, if it processes the event, but crashes before it can commit its progress.

Exactly.

So how do we engineer that final robust, Exactly.

what guarantee using the two message queues and external storage?

We use a mechanism that's like a distributed transaction.

There are three atomic steps.

First, the aggregation service processes the data and sends the result to the downstream queue, MQ2.

Okay.

Second, the service waits for an acknowledgement from MQ2 that the write was successful.

Only after getting that acknowledgement does it do the third step, finally saving its upstream offset, where it left off an MQ1 to durable external storage like HDFS or S3.

And if any of those three steps fail?

The whole unit of work is rolled back.

If the aggregator crashes after writing to MQ2, but before saving the offset, when it restarts, it reads the old offset from S3 and reprocesses the event.

No duplication, because the previous attempt was never fully committed.

The operational overhead to manage that three -way commit must be immense, but it's non -negotiable for billing.

It is.

It's complex, but you have to do it.

Let's talk about scaling for that 30 % yearly growth, starting with the message queue itself.

For Kafka, the producers are easy.

Just add more servers.

For consumers, scaling means rebalancing partitions.

To optimize this, we have to use the add as the hashing key to keep related events together.

And we should preallocate a very generous number of partitions.

And to scale the aggregation service, the actual MapReduce engine?

Horizontal scaling is the foundation.

Just add more nodes.

To increase throughput within a node, we can use multi -threading, dedicating threads to different addeds, or even multi -processing managed by something like Apache Yarn.

And since we chose Cassandra, the database scaling seems more straightforward, right?

Relatively, yes.

Cassandra is built for this.

It handles horizontal scaling natively with consistent hashing.

You add a new node, and data just automatically rebalances across the cluster.

No manual sharding.

This brings up the inevitable what if question, the hotspot issue.

What happens if one massive advertiser buys a Super Bowl ad and clicks for that single adage to swamp the one node assigned to it?

That's a huge problem.

That one server will fail or suffer catastrophic latency.

The mitigation is a resource manager that's always watching for overload.

If a node detects it's getting a crazy volume for one added, it requests help.

The resource manager then dynamically allocates extra dedicated nodes just to process the traffic for that one popular ad.

See, dynamic adaptive partitioning to fight fire with fire.

Very cool.

Okay, finally, fault tolerance.

Since the aggregation state is in memory for speed, if a node fails, you lose that state.

How do you recover quickly?

We use a system status snapshot.

Periodically, we save the critical in -memory state like the partial click counts or the current top -end lists and the current upstream offset altogether.

If a node fails, a new one just loads the latest snapshot and only needs to replay events that arrived after that point.

It minimizes the replay time dramatically.

And to ensure correctness over the long haul, we rely on reconciliation.

Reconciliation is our final data integrity check.

We run a low priority daily batch job that does a definitive count on all the raw events.

We then compare that to the real -time aggregation output.

It ensures long -term consistency and provides that final layer of confidence for billing.

This was a deep dive into some serious complexity.

We went from the initial constraints at one clicks a day to the API, the star schema, the move to Kappa architecture, the incredibly tricky business of exactly once delivery and strategies for growth, hotspots, and failures.

What really stands out here is that this kind of system is a quintessential big data pipeline.

Understanding these trade -offs, accuracy versus latency, the difficulty of guaranteed delivery, that's really the key takeaway for mastering distributed systems design.

Yeah, if you just consider the sheer operational difficulty of implementing that distributed transaction, making three separate systems commit atomically in a high -throughput environment, that's where theory meets some very complex real -world application.

That complexity is what makes the system so valuable.

We hope this deep dive provided the knowledge shortcut you were looking for.

Thank you for joining us.

And a warm thank you from the last -minute lecture team.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Building a scalable ad click event aggregation system requires managing massive data flows while ensuring precise measurement of advertising performance metrics like click-through rate and conversion rate, which directly influence real-time bidding and billing accuracy. Systems operating at internet scale must process approximately one billion daily events with average throughput of 10,000 queries per second and peak demand reaching 50,000 queries per second, all while maintaining end-to-end latency measured in minutes rather than seconds. The architecture achieves this through asynchronous message-driven design, where Kafka decouples log collection from downstream aggregation services, allowing each component to scale independently and tolerate failures without cascading failures across the system. NoSQL databases like Cassandra provide the write-heavy throughput and horizontal scalability required for time-series event data, with virtual node configurations enabling seamless node addition as load grows. Aggregation logic itself leverages the MapReduce paradigm organized as a Directed Acyclic Graph, enabling flexible data transformations across multiple processing stages. Two window types serve different analytical needs: tumbling windows aggregate counts over fixed, non-overlapping intervals, while sliding windows compute rolling metrics such as the top N most-clicked advertisements within recent time periods. Achieving exactly-once processing semantics—critical for billing where duplicate charges are unacceptable—demands careful coordination of Kafka offset commits only after successful downstream processing completion. Processing events by their original timestamp rather than arrival time requires implementing watermark techniques that gracefully handle delayed data arriving after window closure. The Kappa architecture unifies real-time aggregation with historical replay capabilities using a single stream processing engine, eliminating the need for separate batch systems. Supporting infrastructure includes state snapshotting for rapid recovery from failures and dynamic resource allocation that detects and mitigates hotspots when certain advertisements generate disproportionate traffic volumes.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 6: Ad Click Event Aggregation Design

Related Chapters