Metrics That Matter: Monitoring Application Health

Metrics that matter include the four golden signals (latency, traffic, errors, saturation), RED metrics (rate, errors, duration), and USE method (utilization, saturation, errors). These help you monitor application health and reliability.

Metrics That Matter: Monitoring Application Health

In modern software systems, collecting metrics is easy. Collecting the right metrics is hard. Many teams fall into the trap of monitoring everything, leading to dashboard overload and alert fatigue. The key is to focus on metrics that actually tell you whether your system is healthy and whether users are having a good experience. This guide covers the most important metrics frameworks: the four golden signals, RED metrics, and the USE method.

These frameworks help you answer the fundamental questions of observability: Is my system working? How well is it working? Are users affected? When something goes wrong, these metrics point you toward the problem. To understand metrics properly, it is helpful to be familiar with observability concepts, logging best practices, and distributed tracing.

What you will learn in this tutorial:
✓ The four golden signals: latency, traffic, errors, saturation
✓ RED metrics: rate, errors, duration for service-level monitoring
✓ USE method: utilization, saturation, errors for resource monitoring
✓ Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
✓ Choosing the right metrics for your application
✓ Common metric mistakes and anti-patterns
✓ Alerting on metrics that matter

Why Choosing the Right Metrics Matters

The wrong metrics lead to false confidence, missed outages, and alert fatigue. Teams that monitor everything often miss critical issues because they are buried in noise. Teams that monitor the right metrics can detect problems before users notice and quickly identify root causes.

  • Reduce Alert Fatigue: Fewer, meaningful alerts mean engineers actually respond.
  • Faster Incident Detection: The right metrics tell you something is wrong immediately.
  • Better Root Cause Analysis: Good metrics point toward the problem area.
  • Improved Reliability: Measuring what matters helps you improve user experience.
  • Cost Efficiency: Collecting and storing fewer metrics reduces costs.

The Four Golden Signals

The four golden signals were defined by Google in the Site Reliability Engineering book. They provide a complete picture of service health from a user's perspective.

Signal What It Measures Key Questions
Latency Time to process a request Are requests taking too long? Is the user waiting?
Traffic Volume of requests How much load is the system handling? Is traffic spiking?
Errors Rate of failed requests Are requests failing? What percentage of users are affected?
Saturation How full is the system? Is the system near capacity? Do we need more resources?

1. Latency

Latency measures the time it takes to process a request. It is the most user-visible metric. Users care about how long they wait. Important to distinguish between successful requests and failed requests, as failed requests may be fast but still bad. Track percentiles (p50, p95, p99) rather than averages.

For example, p99 latency of 500ms means 99 percent of requests complete within 500ms, while 1 percent take longer. High latency indicates performance problems, database slowness, or resource contention.

2. Traffic

Traffic measures the volume of requests hitting your system. This includes requests per second, concurrent users, requests per minute, network I/O, and active sessions. Traffic helps you understand load patterns and capacity needs.

Sudden traffic spikes can indicate a successful marketing campaign, a DDoS attack, or a bug causing infinite loops. Traffic drops can indicate routing problems or service failures.

3. Errors

Errors measure the rate of failed requests. Track error rate as a percentage of total requests. Distinguish between different types of errors such as HTTP 4xx which are client errors, HTTP 5xx which are server errors, exceptions, and timeouts.

The goal is to keep error rates near zero. Sudden error spikes indicate problems that need immediate attention. Track errors by endpoint to identify specific problem areas.

4. Saturation

Saturation measures how full your system is. It answers the question: How much capacity is left? Saturation metrics include CPU utilization, memory usage, disk usage, network bandwidth, connection pool usage, thread pool usage, and queue length.

High saturation indicates that the system is near its limits and may soon become slow or unresponsive. Saturation helps with capacity planning and predicting when you need to scale.

RED Metrics for Services

RED metrics, popularized by Tom Wilkie, are designed specifically for monitoring microservices. RED stands for Rate, Errors, and Duration. It is a simplified version of the four golden signals, focusing on the user experience.

Metric What It Measures Example Question
Rate Requests per second How many requests is my service handling?
Errors Percentage of failed requests What percentage of requests are failing?
Duration Time to process requests How long are requests taking?

RED metrics are best for service-level monitoring. Every service should have a dashboard showing its rate, errors, and duration. When an alert fires, you can quickly see which service is having problems and whether it is a latency issue, error spike, or traffic anomaly.

USE Method for Resources

The USE method, developed by Brendan Gregg, focuses on resource utilization. It is designed for monitoring infrastructure components like servers, databases, and caches. USE stands for Utilization, Saturation, and Errors.

Metric What It Measures Examples
Utilization Average time resource was busy CPU usage, memory usage, disk I/O, network bandwidth
Saturation Amount of work queued or waiting Run queue length, disk wait time, connection queue
Errors Number of error events Disk errors, network errors, out of memory kills

The USE method is best for capacity planning and identifying resource bottlenecks. For example, high CPU utilization indicates the server is busy. High saturation, such as run queue length, indicates the server is overloaded and tasks are waiting.

Service Level Indicators (SLIs) and Service Level Objectives (SLOs)

SLIs are quantitative measures of a service's reliability. SLOs are target values for those measures. Together, they define what "good" looks like for your service.

Common SLIs include request latency, which is the proportion of requests under a threshold such as 95 percent under 200ms, error rate which is the proportion of successful requests such as 99.9 percent success rate, and availability which is uptime percentage such as 99.95 percent available. Other SLIs include throughput and saturation.

SLOs should be achievable but ambitious. A typical SLO might be 99.9 percent of requests succeed within 200ms over a 30-day rolling window. Error budgets are the amount of unreliability allowed within an SLO. For a 99.9 percent SLO, the error budget is 0.1 percent of requests. When the error budget is exhausted, slow down changes and focus on reliability.

Metrics by System Component

Component Key Metrics What They Tell You
Application Request rate, error rate, latency (p95, p99) User experience and service health
Database Query count, slow query count, connection count, cache hit ratio Database performance and capacity
Cache like Redis Hit ratio, memory usage, eviction rate, command rate Cache effectiveness and capacity
Message Queue Queue length, message age, consumer lag Backlog and processing delays
Load Balancer Request rate, backend health, latency Traffic distribution and backend health
External API Call count, error rate, latency Dependency health

What NOT to Measure

Just as important as knowing what to measure is knowing what not to measure. Avoid vanity metrics that look good but provide no actionable information, such as total requests without context, average latency without percentiles, or uptime without user impact. Avoid metrics you never look at; if you collect it, you should have a dashboard or alert for it. Avoid overly granular metrics that generate high cardinality, such as user-specific metrics or request-specific metrics. Avoid metrics with no clear action; if you cannot act on a metric, it is noise.

Metrics Cardinality and Cost

Metrics with high cardinality have many unique value combinations, such as user ID, request ID, or session ID. High-cardinality metrics can become extremely expensive to store and query.

For example, a metric tagged with user ID for millions of users creates millions of unique time series. Each time series has storage and query costs. Use high-cardinality metrics sparingly and consider using logs or traces instead.

To manage cardinality, use labels with limited possible values, such as HTTP method like GET, POST, PUT, DELETE or status code class like 2xx, 4xx, 5xx. Avoid unbounded labels like user ID or email.

Alerting on Metrics

Metrics are only useful if you act on them. Alerting turns metrics into action. Every alert should be actionable, meaning someone can do something about it. Alerts should be urgent, meaning they require immediate attention. They should be accurate, minimizing false positives. And they should be clear, explaining what is wrong and what to do.

Example of a good alert: High error rate alert for payment service is 5 percent, exceeding the 1 percent SLO for the last 5 minutes. Possible cause: Database connection pool exhaustion. Suggested action: Check database connections and restart payment service pods.

Example of a bad alert: CPU alert that says CPU usage is 85 percent. This alert may not require immediate action. It is not user-facing and may be normal. It does not indicate a user-impacting problem.

Metrics That Matter by Role

Role Primary Metrics Secondary Metrics
Developer Error rate, latency (p99), request rate Database query time, cache hit ratio
Operations / SRE Saturation (CPU, memory, disk), error budget Alert volume, incident duration
Product Manager User engagement, feature usage, conversion rates Daily active users, session duration
Business Revenue, orders, customer satisfaction Page load time for business impact

Common Metrics Mistakes to Avoid

  • Using Averages: Averages hide outliers. A service with p99 latency of 10 seconds is failing 1 percent of users even if average is 50ms. Use percentiles such as p50, p95, p99.
  • Alerting on Symptoms Instead of Causes: Alert on user-facing metrics like error rate and latency, not on CPU usage. CPU may be high without user impact.
  • No Error Budget: Without an error budget, you cannot decide when to prioritize features versus reliability.
  • Too Many Alerts: Alert fatigue causes engineers to ignore critical alerts. Keep alerts few and meaningful.
  • Ignoring Saturation: High utilization without saturation monitoring leads to unexpected capacity exhaustion.
  • No Baselines: Without historical baselines, you cannot detect anomalies. Track metrics over time.

Metrics Best Practices Summary

  • Start with RED metrics for every service: Rate, Errors, Duration
  • Add USE method for infrastructure: Utilization, Saturation, Errors
  • Track the four golden signals: Latency, Traffic, Errors, Saturation
  • Define SLIs and SLOs for user-facing services
  • Use percentiles such as p50, p95, p99, not averages
  • Alert on symptoms like error rate and latency, not causes like CPU
  • Monitor saturation to predict capacity issues before they happen
  • Keep alerts actionable, urgent, and accurate
  • Track error budgets to balance features and reliability
  • Regularly review and remove unused metrics

Frequently Asked Questions

  1. What is the difference between latency and duration?
    Latency and duration are often used interchangeably. Duration typically refers to the time a request spends inside your service. Latency includes the full time from client request to response, including network time. For user-facing metrics, use latency. For service-internal metrics, use duration.
  2. What percentile should I use for latency?
    p99 is common for user-facing services. p95 is acceptable for internal services. p99.9 is for critical services with high SLOs. Avoid p50 which is median, as it hides the long tail.
  3. What is a good error rate SLO?
    99.9 percent success rate which means 0.1 percent error rate is common for many services. 99.99 percent for critical services. 99 percent for internal or less critical services.
  4. What is the difference between SLI and SLO?
    SLI is the actual measured value, such as 99.95 percent success rate. SLO is the target value, such as 99.9 percent success rate. The error budget is the difference, 0.05 percent in this example.
  5. How many metrics should I collect?
    Collect as many as you need to answer key questions, but no more. Every metric has storage and query costs. Start with RED metrics and USE method, then add specific metrics for known problem areas.
  6. What should I learn next after metrics?
    After mastering metrics, explore observability, distributed tracing, logging best practices, and SLO framework for comprehensive system visibility.

Conclusion

Collecting metrics is easy. Collecting the right metrics is hard. The four golden signals, RED metrics, and USE method provide frameworks for focusing on what matters: user experience, service health, and resource capacity. Start with RED metrics for every service. Add USE method for infrastructure. Define SLIs and SLOs to measure reliability. Track percentiles, not averages. Alert on symptoms, not causes. And regularly review your metrics to remove what is not useful.

The goal is not to collect as many metrics as possible. The goal is to answer the fundamental questions: Is my system working? How well is it working? Are users affected? When something goes wrong, the right metrics point you toward the problem. When nothing is wrong, they give you confidence that your system is healthy.

To deepen your understanding, explore related topics like observability, distributed tracing, logging best practices, and SLO framework for comprehensive system visibility.