Chapter 5: Metrics Monitoring & Alerting System Design

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are tackling something absolutely fundamental for any large -scale operation,

designing a scalable metrics monitoring and alerting system.

That's right.

And our mission today is really to kind of reverse engineer the thinking behind systems like Datadog or Prometheus, but with a twist.

We're designing this for internal use at a massive corporation.

So this isn't just a theoretical exercise.

Not at all.

This is about getting that crystal clear visibility you need to keep things running to guarantee that high availability when, you know, failure just isn't an option.

And the scale you've laid out for us is, it's immense.

We're talking about building for an infrastructure that's serving a hundred million

daily active users.

Yeah.

And let's put that in engineering terms.

That means we're watching about 1000 server pools, each with a hundred machines.

So that's a hundred thousand servers.

And when you think about collecting data every say 10 seconds from each one, that's just a constant unrelenting flood of data.

Exactly.

Millions and millions of operational metrics are being written to our system every single day.

That heavy continuous right load.

That's the central problem we have to solve architecturally.

Let's be really clear about what we're collecting them.

We are laser focused on the low level stuff, the operational system metrics, CPU load, memory usage, request counts, that sort of thing.

Right.

And crucially, we're not talking about logs or business metrics or that whole world of distributed tracing.

That's a whole other deep for another day.

It is.

And managing this fire hose of data means our retention policy has to be really smart, really tiered.

This is a huge design challenge.

So how does that work?

We keep the raw super high res data for only seven days.

After that, we roll it up to a one minute resolution and we keep that for 30 days and for long -term trends for anything up to a year.

We only keep it at a one hour resolution.

That's the only way to make the storage manageable.

Plus the functional requirements are incredibly strict.

Reliability is paramount.

You can't miss a critical alert ever.

Nope.

And you need low latency for the engineers staring at dashboards.

And the whole thing has to be flexible.

You need to be able to swap components out without, you know, bringing down the entire monitoring world.

Okay.

Let's get into the fundamentals.

Where do we start?

So any good monitoring system really breaks down into five key parts.

First, you have data collection, the agent grabbing the data.

Got it.

Second, data transmission, making sure it gets there reliably.

Third is data storage, which is all about time series.

Fourth is alerting where you spot the problems.

And finally, visualization.

Exactly.

Making it all understandable for a human.

Right.

Let's start with the data model then, because that feels like it dictates everything that follows.

It really does.

At its core, all this metrics data is a time series.

It's a stream of values, each with a timestamp.

But its series is uniquely identified by its name and a set of labels, right?

Precisely.

So a single data point might have a name like cpu .load.

Then you have these key value labels for context, like host .i631 and nv .prod.

And then of course the timestamp and the actual value, maybe 0 .29.

And those labels are the superpower here.

They're how you slice and dice the data.

They are.

If you want the average CPU load across your entire US West region, you just query for everything with the label region, swest.

The system finds them all and does the math.

What's really interesting to me is the data access pattern.

You mentioned this mismatch.

Oh, it's a huge mismatch.

You have this enormous constant, super heavy,

all those metrics pouring in second after second.

But the reads.

The reads are totally different.

They're spiky, bursty.

It's someone loading a dashboard or an alert firing.

This immediately tells us what kind of database we can't use.

So that brings up the obvious question.

Why not just a big cluster of something like MySQL?

They're robust.

Everyone knows them.

What's the killer bottleneck?

Well, it's really threefold.

First, a relational database just isn't optimized for time series math.

A simple rolling average becomes a really complex, heavy SQL query.

Okay, this one.

Second, they just, they struggle under that relentless high write load.

But the biggest killer, the real bottleneck is indexing.

The labels.

Yes.

Every unique combination of labels, host, environment, service name creates a brand new, unique time series.

The total number of those is what we call cardinality.

And with high cardinality.

Indexing all of those tags in a standard database just becomes prohibitively expensive.

It completely destroys performance.

So the answer is a specialized tool for the job.

A time series database or a TSDB.

Exactly.

Things like InfluxDB or Prometheus.

They are engineered from the ground up for this specific problem.

And they can handle the load.

The source material points out that a high end TSDB setup can handle over 250 ,000 writes per second.

So yes, they can definitely keep up.

The whole game becomes about keeping that time series cardinality as low as possible.

All right, let's sketch out the high level design then.

Verbal.

Okay.

So you start with the metrics source or applications.

That data flows into a metrics collector.

The collector then feeds the time series database.

Right.

And the TSDB is the heart.

It powers a query service, which is this middle layer that feeds two things.

The alerting system.

Which sends out pages via pager duty or emails.

And the visualization system.

Your dashboards.

Perfect.

So let's dive into that first piece.

Metrics collection.

This is where we hit that classic debate, isn't it?

It is the fundamental architectural choice.

Let's break down the pull model first.

How does that work?

In the pull model, you have these dedicated collector servers that periodically reach out and, well,

pull the metrics data from your services.

Usually over HTTP from an endpoint like metrics.

But how does it know which servers to pull from?

That's where service discovery comes in.

It uses a system like Etceter or ZooKeeper as a central registry to know what's running and where it is.

So to scale that for our hundred thousand servers, we can't just have one giant collector.

No way.

You use a whole pool of them and you use a clever technique called consistent hashing.

I love this.

It's brilliant because it ensures each server is only monitored by one collector.

Right.

Exactly.

And if a collector dies or you add a new one, it minimizes the reshuffling.

It prevents you from pulling duplicate data or having massive reconfiguration headaches.

It's really elegant.

Okay.

So what's the alternative?

The push model with the push model.

You flip it around.

You install a lightweight agent on every single one of your servers and that agent collects metrics locally and then proactively pushes them to a central metrics collector endpoint.

And of course, to handle all that incoming traffic, the collector itself has to be a big auto scaling cluster behind a load balancer.

So when you compare them, versus push, there's no silver bullet.

Not at all.

It's all about trade offs.

Pole is great for debugging.

If a collector can't connect to a server to pull metrics, you know, instantly that server is down.

Right.

But push is often better for really short lived jobs, like a batch process that might spin up and die before a scheduled poll ever happens.

And I imagine push also wins in really complex network environments.

Oh, absolutely.

With tricky firewalls and multiple data centers, the push model is simpler.

The central collector just has to accept incoming traffic from anywhere.

It doesn't need firewall rules to let it initiate connections out to thousands of different machines.

Okay.

So we've collected the data.

Now we have to get it to the database reliably.

You mentioned the risk of data loss.

If the TSTP goes down for a minute, is a messaging queue, the only answer seems like it adds complexity.

That's a great question.

And it really gets to the heart of reliability.

We introduce queues, specifically something like Kafka, precisely because we cannot lose this data.

So it's a buffer.

It's more than a buffer.

It decouples the services.

The collectors just fire and forget their data to Kafka.

If the database downstream is having a bad day, Kafka just holds onto the data safely until it's back online.

No data loss.

Right.

And then you have consumers pulling from Kafka.

Yeah.

You'd use a stream processing engine like Storm or Flink.

They act as consumers, pull the data from Kafka, and then they're the ones responsible for writing it to the TSTP.

And you can scale Kafka by partitioning the data streams.

You can.

You could even have higher priority partitions for more critical metrics to ensure they get processed faster.

It gives you a lot of control.

So this pipeline is pretty standard for high -scale designs.

Yeah, it is.

I mean, there are some specialized in -memory TSTPs like Facebook's Gorilla that are designed to handle this differently.

But for most systems using persistent storage, Kafka is the key to decoupling and reliability.

Okay.

Moving on, let's talk about aggregation.

Our retention policy is tiered, so we have to roll data up.

Where in this pipeline does that happen?

You've got three main options.

The first and simplest is right at the source, on the collection agent itself.

Let's do a client side.

Right.

The agent can do simple roll -ups like counting something over a minute, which reduces the amount of data you send over the network.

The downside is it's not very flexible.

The second option would be in the middle, in that ingestion pipeline with Flink or Storm.

Yep.

This can massively reduce the write load on your final database.

The trade -off is huge, though.

You lose that raw, high -precision data forever,

and handling data that arrives late gets really tricky.

And the third option.

Is to do it on the query side.

You just write all the raw, high -fidelity data to storage and only do the aggregation when a user or an alert actually queries for it.

So that gives you maximum flexibility.

Total flexibility is zero data loss.

The obvious downside is that your queries are slower, because you're running that computation over the huge raw data set every single time.

That choice leads us right into the query service and a cache layer.

The query service acts as a kind of middleman.

It's a crucial abstraction layer.

It decouples your visualization tools and your alerting clients from the actual database.

Meaning you could swap out your TSDB for a different one later without breaking everything that depends on it.

Exactly.

And then you add a cache layer, maybe with Redis, to store the results of common queries, the stuff that powers your main dashboards.

It saves the database from getting hammered with the same requests over and over.

But there's a case against the query service, too.

Why might you not build this custom layer?

Because, honestly, modern tools are really good.

A powerful TSDB and an off -the -shelf visualization system like Grafana have incredibly robust, optimized plugins.

They can often talk to each other so efficiently that building your own custom query service and cache is just reinventing the wheel.

The query language itself is a big factor here, too.

A huge factor.

Trying to compute something like an exponential moving average in standard SQL is

a mess.

It's incredibly complex.

Whereas language like Flux from InfluxDB is designed specifically for these kinds of time series operations.

The syntax is far simpler, more readable, and way more efficient.

Let's spend some time with the storage layer optimizations, because that's where the magic really happens to handle this scale.

It really is.

And remember that Facebook research we talked about,

85 % of all queries are for data from the last 26 hours.

So the whole database needs to be optimized for lightning -fast access to very recent data.

Fundamentally.

And to manage the sheer size, you rely heavily on encoding and compression.

Let's take double delta encoding for timestamps as an example.

Okay, break that down for me.

Instead of storing the full 32 -bit timestamp for every single data point, these huge numbers one after another, you get clever.

If your data points are always 10 seconds apart.

You just store the difference.

Yeah.

The delta.

So you store 10, then 10, then 10.

Right.

But then you go a step further, you store the difference between the deltas, the double delta.

Since the interval is constant, the difference between 10 and 10 is 0.

You can store that 0 in just a few bits.

Wow.

So you go from needing 32 bits to maybe just 4 bits.

It's a transformative level of compression.

And that leads right into downsampling, which is just putting our retention policy into practice.

Taking that high -res 10 -second data and, as it gets older, rolling it up into bigger buckets.

Exactly.

Into 30 -second data, then 1 -minute data after 7 days, and finally, into that 1 -hour data for long -term retention.

And we can even move that really old, rarely -used data to cheaper cold storage, like an object store, to save even more money.

Okay.

Last major component, the alerting system.

The alert manager is the core.

It's constantly fetching your alerting rules, which are usually defined in a simple config file like YAML.

And it uses the query service to check the metrics against the thresholds in those rules.

And if a rule is violated, say a server is down for more than 5 minutes, it creates an alert event.

But the alert manager is smarter than just that.

It does things like deduping.

Yes.

Filtering, merging, and deduping are critical.

If three servers in the same cluster all start failing their health checks,

you don't want three separate pages.

You want one single notification that says,

this cluster is in trouble.

Precisely.

It prevents alert fatigue.

It also handles things like who gets the alert and retrying if a notification fails to send.

And to make sure you don't send duplicate alerts or lose track of one, you need an alert store.

You do.

It's usually a key value database, like Cassandra, that just maintains the state of every single alert.

Is it pending?

Is it currently firing?

Or has it been resolved?

And then alert consumers pull from Kafka and send the actual notification.

It would be an email, pager duty, whatever you need.

And it's worth saying again that build versus buy decision is huge here.

For visualization and alerting, unless your needs are truly bizarre, off -the -shelf tools are almost always the right call.

So that brings us to the final design.

We've walked through all the critical decisions, pull versus push for collection.

Using Kafka for a reliable data pipeline, choosing a specialized TSDB over a relational one.

And then using compression and down sampling to manage the data growth.

And finally, making smart choices about what to build versus what to buy.

And if we circle back to that key insight, that 85 % of queries hit data from the last 26 hours, it brings up a really interesting final thought for you to chew on.

Go for it.

How would you fundamentally change your entire data pipeline if you made a really aggressive policy decision to strictly discard all raw high -resolution data after just 12 hours and rely only on aggregated roll -ups for anything older than that?

Wow.

That would mean your aggregation process itself would have to become the most reliable, most fault -tolerant part of the entire system.

That's the real architectural challenge.

Your focus on reliability shifts entirely up the pipeline.

That's a fascinating thought.

We hope this deep dive has given you the clarity you need to think about high -scale metrics monitoring.

Congratulations on getting this far.

Now, give yourself a pat on the back.

Good job.

Thank you from the Last Minute Lecture team for joining us.

We hope you feel significantly more informed on how to build a scalable and reliable metrics infrastructure.

Join us next time for another deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Designing a scalable metrics monitoring and alerting system requires architecting multiple interconnected components capable of handling massive operational demands across large infrastructure environments. The system collects infrastructure metrics such as CPU utilization, memory consumption, and request throughput while deliberately excluding business metrics, log aggregation, and distributed tracing functionality. At the core, the architecture addresses the challenge of supporting 100 million daily active users with differentiated data storage strategies: raw high-resolution data persists for seven days while older data undergoes systematic downsampling to reduce storage overhead while maintaining a complete one-year retention window. Time-series data modeling, where metrics are identified by name and associated label sets, becomes essential for managing the system's dual performance demands of sustained high-volume writes and unpredictable read spikes. Specialized time-series databases such as InfluxDB or Prometheus outperform traditional relational systems because they natively optimize for time-ordered queries and temporal calculations like moving averages. Data collection strategies diverge into two primary approaches: pull models employ a metrics collector that discovers endpoints through service discovery systems like etcd or ZooKeeper, while push models deploy lightweight agents that aggregate data locally before transmission, particularly valuable in firewalled environments or ephemeral workloads. Distributed message queues like Kafka provide decoupling between collection and ingestion layers, enabling partitioning by metric name or labels to prevent bottlenecks. Aggregation decisions represent crucial architectural tradeoffs: early aggregation at collection agents reduces downstream storage load but sacrifices raw data precision, stream processing frameworks like Flink enable intermediate aggregation during ingestion, while query-time aggregation preserves complete information at the cost of computational overhead during retrieval. The query service interfaces with the time-series database through optional caching layers, while visualization platforms like Grafana render metrics for human interpretation. The alerting subsystem operates through an alert manager that evaluates YAML-defined rules, maintains alert state in key-value stores, deduplicates alerts fired within short windows, and routes notifications through multiple channels including email and incident management platforms like PagerDuty, often leveraging message queues to ensure reliable delivery.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 5: Metrics Monitoring & Alerting System Design

Related Chapters