Monitoring: Observing System Health and Performance

Monitoring is the practice of collecting, analyzing, and alerting on system metrics to understand health, performance, and availability. It includes infrastructure monitoring, application performance monitoring (APM), and log monitoring.

Monitoring: Observing System Health and Performance

Monitoring is the practice of collecting, analyzing, and acting on data about a system's health, performance, and availability. It answers questions like: Is the system up? How fast is it responding? Are there errors? Is there enough capacity? Monitoring uses metrics (quantitative data), logs (discrete events), and health checks to provide visibility into system state. Unlike observability which enables exploration of unknown issues, monitoring focuses on known failure modes (alerts, dashboards). Effective monitoring reduces downtime, improves performance, and provides data for capacity planning.

To understand monitoring properly, it helps to be familiar with observability concepts, metrics, and alerting fundamentals.

Monitoring overview:
┌─────────────────────────────────────────────────────────────────────────┐
│                           Monitoring Types                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Infrastructure                    Application                         │
│   Monitoring                        Performance Monitoring (APM)        │
│   ┌─────────────────────┐           ┌─────────────────────────────┐     │
│   │ CPU, Memory, Disk   │           │ Request rate (RPS)          │     │
│   │ Network I/O         │           │ Error rate (4xx, 5xx)       │     │
│   │ Database metrics    │           │ Response time (p50, p95)    │     │
│   │ (connections, QPS)  │           │ Transaction tracing         │     │
│   │ Server uptime       │           │ Business metrics (orders)   │     │
│   └─────────────────────┘           └─────────────────────────────┘     │
│                                                                          │
│   Log Monitoring                     Security Monitoring                │
│   ┌─────────────────────┐           ┌─────────────────────────────┐     │
│   │ Error logs          │           │ Failed logins               │     │
│   │ Access logs         │           │ Suspicious requests         │     │
│   │ Structured logs     │           │ Vulnerability scans         │     │
│   │ Log aggregation     │           │ Compliance checks           │     │
│   │ Alert on ERROR      │           │ Intrusion detection         │     │
│   └─────────────────────┘           └─────────────────────────────┘     │
│                                                                          │
│   Golden Signals (Google SRE) – Four Key Metrics:                       │
│   • Latency – Time to serve request (p99)                               │
│   • Traffic – Demand on system (RPS, concurrent users)                 │
│   • Errors – Rate of failed requests                                   │
│   • Saturation – How full is the system? (CPU, memory, disk)           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

What Is Monitoring?

Monitoring is the process of collecting, aggregating, and analyzing data from systems to understand their behavior and detect problems. It involves setting up metrics, dashboards, and alerts to notify operators when something goes wrong. Monitoring answers known questions: Is the database up? Are response times increasing? Is disk space running out? It is a reactive practice (alert on predefined conditions). Monitoring is essential for maintaining service level objectives (SLOs) and reducing mean time to detection (MTTD) and mean time to recovery (MTTR).

  • Metrics: Numerical measurements over time (CPU usage, request rate, error count). Optimized for storage and query performance.
  • Logs: Structured or unstructured text records of discrete events. Used for debugging and auditing.
  • Health Checks: Endpoints that return status (200 OK or 500 error). Used by load balancers and orchestration systems.
  • Alerts: Notifications triggered when metrics exceed thresholds (e.g., CPU > 80 percent). Sent to on-call engineers, Slack, PagerDuty.
  • Dashboards: Visual representations of metrics (graphs, heatmaps, tables). Used for at-a-glance system health.

Why Monitoring Matters

Without monitoring, you are flying blind. You cannot know if your system is healthy, if users are experiencing errors, or if you need to scale.

  • Detect Problems Early (Reduce MTTD): Automated alerts notify you before users notice issues. Proactive detection reduces downtime, customer impact, and revenue loss.
  • Understand User Experience: Response time metrics, error rate monitoring, and business metrics (conversion rate, checkout completion).
  • Capacity Planning: Trend analysis shows resource usage growth (CPU, memory, disk). Predict when you need to scale (weeks or months in advance). Avoid unexpected outages.
  • Post-Incident Analysis (Root Cause): Logs, metrics, and traces help find root cause. Prevent recurrence.
  • Compliance and Auditing: Access logs, audit trails, and uptime reports for SOC2, HIPAA, PCI DSS.
Golden Signals (Google SRE):
Signal          Description                         Example Metric
─────────────────────────────────────────────────────────────────────────────
Latency         Time to serve request               p99 latency (API)
Traffic         Demand on system                    Requests per second (RPS)
Errors          Rate of failed requests             5xx responses / total requests
Saturation      How "full" is the system?           CPU utilization, memory usage

Why these four?
  • Together they provide complete picture of system health
  • Alert on any signal deviation
  • Troubleshooting: if latency high → check saturation (CPU) → maybe scale

Example: Web service
  • Latency: p99 < 200ms
  • Traffic: 1000 RPS peak
  • Errors: < 0.1% 5xx
  • Saturation: CPU < 70%

Types of Monitoring

Type What It Monitors Example Metrics
Infrastructure Monitoring Servers, networks, storage, databases CPU, memory, disk I/O, network throughput
Application Performance Monitoring (APM) Application code, transactions, dependencies Response time, error rate, throughput, traces
Log Monitoring Application and system logs Error counts, warning patterns, audit events
Security Monitoring Security events, vulnerabilities, compliance Failed logins, suspicious IPs, CVE scans
Business Monitoring Business KPIs (outside of technical metrics) Orders per minute, conversion rate, active users

Monitoring Architecture

Prometheus-based monitoring stack (open source):
Application/Server
       │
       ▼ (pull metrics: /metrics)
   ┌──────────┐
   │Prometheus│ ←── Store time-series data
   └────┬─────┘
        │
        ▼
   ┌──────────┐
   │ Grafana  │ ←── Dashboards (visualization)
   └────┬─────┘
        │
        ▼
   ┌──────────┐
   │Alertmanager│ ←── Alerts (Slack, PagerDuty, Email)
   └──────────┘

Metrics collection:
  • Node Exporter (system metrics: CPU, memory, disk)
  • cAdvisor (container metrics)
  • Blackbox Exporter (HTTP probes, ping)
  • Application exposes /metrics (custom business metrics)

Alerting rules:
  • CPU > 80% for 5 minutes
  • Service down (probe failed)
  • Error rate > 1% for 2 minutes
Grafana dashboard (example queries):
# Request rate (RPS)
rate(http_requests_total[5m])

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

Setting Up Alerts

Alerts notify operators when something goes wrong. They should be actionable, not noisy (avoid false positives). Alert severity levels: critical (page immediately), warning (email/Slack), informational (dashboard only). Define runbooks for each alert (how to respond).

Alerting best practices (example rules):
# Critical alert (page)
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[2m])) 
    / sum(rate(http_requests_total[2m])) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error rate > 5% for 2 minutes"

# Warning alert (Slack only)
- alert: HighCPU
  expr: (100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "CPU > 80% for 10 minutes"

# No alert on disk space (informational only)
- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "Disk space below 10%"

Monitoring Anti-Patterns

  • Alert Fatigue (Too Many False Positives): Operators ignore alerts, miss critical issues. Tune thresholds (use for time, e.g., CPU > 80% for 10 min). Use anomaly detection (machine learning).
  • Not Monitoring Golden Signals (Only CPU/Memory): High CPU doesn't always impact users (batch jobs). Monitor user-facing metrics (latency, errors).
  • No Runbooks for Alerts: On-call engineer receives alert, doesn't know what to do. Document runbooks (step-by-step troubleshooting).
  • Monitoring Everything (Storage Cost): High cardinality metrics (user_id, request_id) blow up storage cost. Use metrics for aggregates, logs/traces for high cardinality.
  • Black-Box Monitoring Only (No Internal Metrics): Checking external endpoint (HTTP probe) is insufficient (misses internal issues). Also monitor internal metrics (database connection pool, queue depth).
Monitoring maturity model:
Level 1: Reactive (Manual)
  • Check logs when something breaks
  • No dashboards
  • Alerts via email (ignored)

Level 2: Basic (Dashboards)
  • Basic dashboards (CPU, memory)
  • Page on-call for critical alerts
  • Some monitoring coverage

Level 3: Proactive (SLOs)
  • Golden signals (latency, errors, traffic, saturation)
  • Service level objectives (SLOs)
  • Error budget tracking

Level 4: Predictive
  • Anomaly detection (ML)
  • Capacity forecasting
  • Automated remediation (auto-scaling)

Monitoring Best Practices

  • Monitor Golden Signals (Latency, Traffic, Errors, Saturation): Provide complete system health view. Alert on deviations from baseline. Use for capacity planning and performance optimization.
  • Use Service Level Objectives (SLOs) for Alerting: Alert when error budget is depleting, not on every error. Example: "p99 latency > 500ms for 5 minutes (error budget 50 percent consumed)".
  • Set Up On-Call Rotation (PagerDuty, Opsgenie): Ensure alerts reach someone 24/7. Escalate if not acknowledged. Use secondary on-call for backup.
  • Write Runbooks for Common Alerts: Document steps to diagnose and resolve. Runbook includes: dashboard link, log query, common fixes, and escalation path. Test runbooks periodically.
  • Use Structured Logging for Easy Querying: JSON logs with consistent fields (timestamp, level, service, trace_id). Query logs via Loki or Elasticsearch. Alert on error patterns ("ERROR" count > 10 per minute).
  • Monitor Dependencies (Third-Party APIs, Databases): Database connection pool, replication lag; external API error rates, latency, and rate limiting; message queue depth (Kafka lag).
  • Regularly Review and Tune Alerts: Delete noisy alerts, adjust thresholds based on observed data, and add new alerts for newly discovered failure modes.
Monitoring tools (open source vs commercial):
Category                Open Source                 Commercial
─────────────────────────────────────────────────────────────────────────────
Metrics                 Prometheus, VictoriaMetrics  Datadog, New Relic
Logs                    Loki, ELK Stack             Datadog, Splunk
Traces                  Jaeger, Tempo               Datadog, Honeycomb
All-in-One              Grafana Stack               Datadog, Dynatrace
(LGTM - Loki, Grafana, Tempo, Mimir)

Picking a tool:
  • Small team, low budget → Prometheus + Grafana + Loki
  • Enterprise, multi-cloud → Datadog (expensive but comprehensive)
  • Kubernetes-native → Prometheus + Grafana + Tempo

Frequently Asked Questions

  1. What is the difference between monitoring and observability?
    Monitoring checks predefined conditions (known failure modes). Observability enables asking arbitrary questions about system state (unknown unknowns). Monitoring is reactive; observability is exploratory. Good monitoring is subset of observability.
  2. How often should I check metrics?
    Critical metrics (error rate, latency) every 10-30 seconds. Capacity metrics (CPU, memory) every 1-5 minutes. Logs can be streamed in real-time. Choose frequency based on how quickly issue needs detection.
  3. What is the difference between a metric and a log?
    Metric is numerical measurement over time (aggregated). Log is discrete event with timestamp (high cardinality). Metrics are for trend analysis, alerting. Logs are for debugging root cause, auditing.
  4. How do I monitor a microservices architecture?
    Use distributed tracing (Jaeger, Tempo) for request flows. Monitor each service's golden signals. Aggregate metrics and logs centrally (Prometheus, Loki).
  5. What are the four golden signals?
    Latency (response time), Traffic (requests per second), Errors (error rate), Saturation (resource utilization). These four metrics give comprehensive view of system health.
  6. What should I learn next after monitoring?
    After mastering monitoring, explore observability pillars (metrics, logs, traces), Service Level Objectives (SLOs) and error budgets, distributed tracing for microservices, Prometheus deep dive, and Grafana dashboards and alerting.