Chapter 6: Ad Click Event Aggregation Design

Loading audio…

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

If there is an issue with this chapter, please let us know → Contact Us

The primary objective of such a system is to measure the effectiveness of digital advertising, supporting key business metrics like click-through rate (CTR) and conversion rate (CVR), which are essential for Real-Time Bidding (RTB) and ad billing. Initial scope requirements demand handling 1 billion daily events, with average and peak write loads of 10,000 QPS and 50,000 QPS respectively, while maintaining end-to-end latency within a few minutes. The high-level design utilizes asynchronous processing, employing message queues (like Kafka) to decouple components such as the log watcher and the aggregation service, enhancing robustness and independent scalability. Data storage, due to the write-heavy, time-series nature of the workload, is best suited for NoSQL databases, exemplified by Cassandra, which supports horizontal scaling via virtual nodes. The aggregation service itself is built upon the MapReduce paradigm, using a Directed Acyclic Graph (DAG) model to process data, utilizing tumbling windows for fixed-interval counts and sliding windows for calculating metrics like the top N most clicked ads over the last M minutes. A crucial requirement for billing accuracy is implementing exactly-once processing, achieved through complex distributed transactions that save the Kafka consumption offset only after receiving downstream acknowledgment. Furthermore, data accuracy necessitates using event time for aggregation, which in turn requires employing a watermark technique to account for delayed events. The overall architecture follows the Kappa approach, enabling both real-time aggregation and historical data recalculation (replay) using a single stream engine. To ensure service quality, the system includes sophisticated mechanisms for fault tolerance, such as snapshotting the aggregation state for fast recovery, and a resource manager to dynamically allocate extra aggregation nodes to mitigate hotspot issues caused by highly popular ads.