Monitoring: Observing System Health and Performance
Monitoring is the practice of collecting, analyzing, and alerting on system metrics to understand health, performance, and availability. It includes infrastructure monitoring, application performance monitoring (APM), and log monitoring.
Monitoring: Observing System Health and Performance
Monitoring is the practice of collecting, analyzing, and acting on data about a system's health, performance, and availability. It answers questions like: Is the system up? How fast is it responding? Are there errors? Is there enough capacity? Monitoring uses metrics (quantitative data), logs (discrete events), and health checks to provide visibility into system state. Unlike observability which enables exploration of unknown issues, monitoring focuses on known failure modes (alerts, dashboards). Effective monitoring reduces downtime, improves performance, and provides data for capacity planning.
To understand monitoring properly, it helps to be familiar with observability concepts, metrics, and alerting fundamentals.
┌─────────────────────────────────────────────────────────────────────────┐
│ Monitoring Types │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Infrastructure Application │
│ Monitoring Performance Monitoring (APM) │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ CPU, Memory, Disk │ │ Request rate (RPS) │ │
│ │ Network I/O │ │ Error rate (4xx, 5xx) │ │
│ │ Database metrics │ │ Response time (p50, p95) │ │
│ │ (connections, QPS) │ │ Transaction tracing │ │
│ │ Server uptime │ │ Business metrics (orders) │ │
│ └─────────────────────┘ └─────────────────────────────┘ │
│ │
│ Log Monitoring Security Monitoring │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ Error logs │ │ Failed logins │ │
│ │ Access logs │ │ Suspicious requests │ │
│ │ Structured logs │ │ Vulnerability scans │ │
│ │ Log aggregation │ │ Compliance checks │ │
│ │ Alert on ERROR │ │ Intrusion detection │ │
│ └─────────────────────┘ └─────────────────────────────┘ │
│ │
│ Golden Signals (Google SRE) – Four Key Metrics: │
│ • Latency – Time to serve request (p99) │
│ • Traffic – Demand on system (RPS, concurrent users) │
│ • Errors – Rate of failed requests │
│ • Saturation – How full is the system? (CPU, memory, disk) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
What Is Monitoring?
Monitoring is the process of collecting, aggregating, and analyzing data from systems to understand their behavior and detect problems. It involves setting up metrics, dashboards, and alerts to notify operators when something goes wrong. Monitoring answers known questions: Is the database up? Are response times increasing? Is disk space running out? It is a reactive practice (alert on predefined conditions). Monitoring is essential for maintaining service level objectives (SLOs) and reducing mean time to detection (MTTD) and mean time to recovery (MTTR).
- Metrics: Numerical measurements over time (CPU usage, request rate, error count). Optimized for storage and query performance.
- Logs: Structured or unstructured text records of discrete events. Used for debugging and auditing.
- Health Checks: Endpoints that return status (200 OK or 500 error). Used by load balancers and orchestration systems.
- Alerts: Notifications triggered when metrics exceed thresholds (e.g., CPU > 80 percent). Sent to on-call engineers, Slack, PagerDuty.
- Dashboards: Visual representations of metrics (graphs, heatmaps, tables). Used for at-a-glance system health.
Why Monitoring Matters
Without monitoring, you are flying blind. You cannot know if your system is healthy, if users are experiencing errors, or if you need to scale.
- Detect Problems Early (Reduce MTTD): Automated alerts notify you before users notice issues. Proactive detection reduces downtime, customer impact, and revenue loss.
- Understand User Experience: Response time metrics, error rate monitoring, and business metrics (conversion rate, checkout completion).
- Capacity Planning: Trend analysis shows resource usage growth (CPU, memory, disk). Predict when you need to scale (weeks or months in advance). Avoid unexpected outages.
- Post-Incident Analysis (Root Cause): Logs, metrics, and traces help find root cause. Prevent recurrence.
- Compliance and Auditing: Access logs, audit trails, and uptime reports for SOC2, HIPAA, PCI DSS.
Signal Description Example Metric
─────────────────────────────────────────────────────────────────────────────
Latency Time to serve request p99 latency (API)
Traffic Demand on system Requests per second (RPS)
Errors Rate of failed requests 5xx responses / total requests
Saturation How "full" is the system? CPU utilization, memory usage
Why these four?
• Together they provide complete picture of system health
• Alert on any signal deviation
• Troubleshooting: if latency high → check saturation (CPU) → maybe scale
Example: Web service
• Latency: p99 < 200ms
• Traffic: 1000 RPS peak
• Errors: < 0.1% 5xx
• Saturation: CPU < 70%
Types of Monitoring
| Type | What It Monitors | Example Metrics |
|---|---|---|
| Infrastructure Monitoring | Servers, networks, storage, databases | CPU, memory, disk I/O, network throughput |
| Application Performance Monitoring (APM) | Application code, transactions, dependencies | Response time, error rate, throughput, traces |
| Log Monitoring | Application and system logs | Error counts, warning patterns, audit events | Security Monitoring | Security events, vulnerabilities, compliance | Failed logins, suspicious IPs, CVE scans |
| Business Monitoring | Business KPIs (outside of technical metrics) | Orders per minute, conversion rate, active users |
Monitoring Architecture
Application/Server
│
▼ (pull metrics: /metrics)
┌──────────┐
│Prometheus│ ←── Store time-series data
└────┬─────┘
│
▼
┌──────────┐
│ Grafana │ ←── Dashboards (visualization)
└────┬─────┘
│
▼
┌──────────┐
│Alertmanager│ ←── Alerts (Slack, PagerDuty, Email)
└──────────┘
Metrics collection:
• Node Exporter (system metrics: CPU, memory, disk)
• cAdvisor (container metrics)
• Blackbox Exporter (HTTP probes, ping)
• Application exposes /metrics (custom business metrics)
Alerting rules:
• CPU > 80% for 5 minutes
• Service down (probe failed)
• Error rate > 1% for 2 minutes
# Request rate (RPS)
rate(http_requests_total[5m])
# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes
Setting Up Alerts
Alerts notify operators when something goes wrong. They should be actionable, not noisy (avoid false positives). Alert severity levels: critical (page immediately), warning (email/Slack), informational (dashboard only). Define runbooks for each alert (how to respond).
# Critical alert (page)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[2m]))
/ sum(rate(http_requests_total[2m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate > 5% for 2 minutes"
# Warning alert (Slack only)
- alert: HighCPU
expr: (100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "CPU > 80% for 10 minutes"
# No alert on disk space (informational only)
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: info
annotations:
summary: "Disk space below 10%"
Monitoring Anti-Patterns
- Alert Fatigue (Too Many False Positives): Operators ignore alerts, miss critical issues. Tune thresholds (use for time, e.g., CPU > 80% for 10 min). Use anomaly detection (machine learning).
- Not Monitoring Golden Signals (Only CPU/Memory): High CPU doesn't always impact users (batch jobs). Monitor user-facing metrics (latency, errors).
- No Runbooks for Alerts: On-call engineer receives alert, doesn't know what to do. Document runbooks (step-by-step troubleshooting).
- Monitoring Everything (Storage Cost): High cardinality metrics (user_id, request_id) blow up storage cost. Use metrics for aggregates, logs/traces for high cardinality.
- Black-Box Monitoring Only (No Internal Metrics): Checking external endpoint (HTTP probe) is insufficient (misses internal issues). Also monitor internal metrics (database connection pool, queue depth).
Level 1: Reactive (Manual)
• Check logs when something breaks
• No dashboards
• Alerts via email (ignored)
Level 2: Basic (Dashboards)
• Basic dashboards (CPU, memory)
• Page on-call for critical alerts
• Some monitoring coverage
Level 3: Proactive (SLOs)
• Golden signals (latency, errors, traffic, saturation)
• Service level objectives (SLOs)
• Error budget tracking
Level 4: Predictive
• Anomaly detection (ML)
• Capacity forecasting
• Automated remediation (auto-scaling)
Monitoring Best Practices
- Monitor Golden Signals (Latency, Traffic, Errors, Saturation): Provide complete system health view. Alert on deviations from baseline. Use for capacity planning and performance optimization.
- Use Service Level Objectives (SLOs) for Alerting: Alert when error budget is depleting, not on every error. Example: "p99 latency > 500ms for 5 minutes (error budget 50 percent consumed)".
- Set Up On-Call Rotation (PagerDuty, Opsgenie): Ensure alerts reach someone 24/7. Escalate if not acknowledged. Use secondary on-call for backup.
- Write Runbooks for Common Alerts: Document steps to diagnose and resolve. Runbook includes: dashboard link, log query, common fixes, and escalation path. Test runbooks periodically.
- Use Structured Logging for Easy Querying: JSON logs with consistent fields (timestamp, level, service, trace_id). Query logs via Loki or Elasticsearch. Alert on error patterns ("ERROR" count > 10 per minute).
- Monitor Dependencies (Third-Party APIs, Databases): Database connection pool, replication lag; external API error rates, latency, and rate limiting; message queue depth (Kafka lag).
- Regularly Review and Tune Alerts: Delete noisy alerts, adjust thresholds based on observed data, and add new alerts for newly discovered failure modes.
Category Open Source Commercial
─────────────────────────────────────────────────────────────────────────────
Metrics Prometheus, VictoriaMetrics Datadog, New Relic
Logs Loki, ELK Stack Datadog, Splunk
Traces Jaeger, Tempo Datadog, Honeycomb
All-in-One Grafana Stack Datadog, Dynatrace
(LGTM - Loki, Grafana, Tempo, Mimir)
Picking a tool:
• Small team, low budget → Prometheus + Grafana + Loki
• Enterprise, multi-cloud → Datadog (expensive but comprehensive)
• Kubernetes-native → Prometheus + Grafana + Tempo
Frequently Asked Questions
- What is the difference between monitoring and observability?
Monitoring checks predefined conditions (known failure modes). Observability enables asking arbitrary questions about system state (unknown unknowns). Monitoring is reactive; observability is exploratory. Good monitoring is subset of observability. - How often should I check metrics?
Critical metrics (error rate, latency) every 10-30 seconds. Capacity metrics (CPU, memory) every 1-5 minutes. Logs can be streamed in real-time. Choose frequency based on how quickly issue needs detection. - What is the difference between a metric and a log?
Metric is numerical measurement over time (aggregated). Log is discrete event with timestamp (high cardinality). Metrics are for trend analysis, alerting. Logs are for debugging root cause, auditing. - How do I monitor a microservices architecture?
Use distributed tracing (Jaeger, Tempo) for request flows. Monitor each service's golden signals. Aggregate metrics and logs centrally (Prometheus, Loki). - What are the four golden signals?
Latency (response time), Traffic (requests per second), Errors (error rate), Saturation (resource utilization). These four metrics give comprehensive view of system health. - What should I learn next after monitoring?
After mastering monitoring, explore observability pillars (metrics, logs, traces), Service Level Objectives (SLOs) and error budgets, distributed tracing for microservices, Prometheus deep dive, and Grafana dashboards and alerting.
