Chapter 5: Metrics Monitoring & Alerting System Design

Loading audio…

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

If there is an issue with this chapter, please let us know → Contact Us

The system is designed to handle a massive scale, supporting 100 million daily active users, and implementing a one-year data retention policy where high-resolution raw data (kept for 7 days) is systematically reduced in resolution through downsampling for long-term storage. The architecture is composed of five core functional blocks: data collection, data transmission, data storage, alerting, and visualization. Core data modeling is based on time series, where metrics are uniquely identified by a name and a set of descriptive labels, which helps manage the constantly heavy write load and the highly spiky read load. Because of these demands, using a specialized time-series database, like InfluxDB or Prometheus, is critical, as standard relational databases like MySQL are poorly suited for the required volume and time-series specific operations, such as calculating moving averages. Data collection can be implemented using either a pull model, where a Metrics Collector pulls data from service endpoints discovered via systems like etcd or ZooKeeper, or a push model, where a collection agent on the source server aggregates and pushes data; the push model is often favored for environments with complicated firewall setups or short-lived batch jobs. To ensure high availability and prevent data loss when the database is temporarily unavailable, the system scales by utilizing a distributed queuing component like Kafka to decouple the data collection and processing services, enabling partitioning by metric name or tags. Aggregations can occur early at the collection agent for simple calculations, within the ingestion pipeline using stream processors like Flink (reducing storage write volume at the cost of raw data precision), or later at the query side (retaining precision but potentially slowing queries). The Query Service interfaces with the time-series database, optionally utilizing a cache layer for performance, and the Visualization System, exemplified by tools like Grafana, displays the data. Finally, the Alerting System uses an Alert Manager that evaluates rules defined in config files (YAML), checks against alert states stored in a key-value database, and handles tasks like filtering and merging multiple alerts that trigger within a short timeframe before routing notifications to various channels, including email and PagerDuty, often via Kafka.