Observability: Understanding System Behavior from External Data

Observability is the ability to understand a system's internal state from its external outputs. It relies on three pillars: metrics (quantitative data), logs (discrete events), and traces (request paths) to enable debugging and optimization.

Observability: Understanding System Behavior from External Data

Observability is the ability to understand a system's internal state by analyzing its external outputs, such as metrics, logs, and traces. In modern distributed systems, failures are inevitable, and traditional monitoring (which checks known failure modes) is insufficient. Observability enables engineers to ask arbitrary questions about system behavior, debug unknown issues, and understand performance bottlenecks. It is a core practice in DevOps and Site Reliability Engineering (SRE), built on three pillars: metrics (quantitative data), logs (discrete events), and traces (request paths across services).

To understand observability properly, it helps to be familiar with distributed systems, monitoring concepts, and performance testing.

Observability overview:
┌─────────────────────────────────────────────────────────────────────────┐
│                         The Three Pillars of Observability               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Metrics (Quantitative)           Logs (Discrete Events)               │
│   ┌───────────────────────┐        ┌───────────────────────────────┐    │
│   │ • CPU usage           │        │ • Error messages              │    │
│   │ • Request rate (RPS)  │        │ • Debug output                │    │
│   │ • Error rate          │        │ • Access logs                 │    │
│   │ • Response time (p99) │        │ • Audit events                │    │
│   │ • Queue length        │        │ • Structured (JSON) preferred │    │
│   └───────────────────────┘        └───────────────────────────────┘    │
│                                                                          │
│   Traces (Request Paths)            Correlation                         │
│   ┌───────────────────────┐        ┌───────────────────────────────┐    │
│   │ • Request ID          │        │ • Trace ID in logs            │    │
│   │ • Service A → B → C   │        │ • Metrics tagged with trace   │    │
│   │ • Span durations      │        │ • End-to-end visibility       │    │
│   │ • Error propagation   │        │ • Root cause analysis         │    │
│   └───────────────────────┘        └───────────────────────────────┘    │
│                                                                          │
│   Monitoring vs Observability:                                           │
│   • Monitoring: Are there errors? (known unknowns)                      │
│   • Observability: Why are there errors? (unknown unknowns)             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

What Is Observability?

Observability is the property of a system that allows engineers to understand its internal state by observing its external outputs. In control theory, a system is observable if its internal state can be inferred from outputs. In software engineering, observability means having the data (metrics, logs, traces) and tooling to ask arbitrary questions about system behavior. Unlike monitoring, which checks pre-defined conditions, observability enables exploration and debugging of unknown issues.

  • Metrics: Quantitative measurements over time (counters, gauges, histograms). Examples: request rate, error rate, latency percentiles, CPU usage. Optimized for storage and query performance (aggregated).
  • Logs: Discrete events with timestamps and metadata (structured text). Examples: error stack traces, user actions, system events. Best for debugging individual events (high cardinality).
  • Traces: End-to-end request paths across distributed services. Show causality and latency breakdown. Enable root cause analysis for failures and performance issues.
  • Correlation: Cross-referencing between pillars using identifiers (trace IDs, correlation IDs). Joins metrics, logs, and traces to provide complete picture.

Why Observability Matters

Distributed systems have too many moving parts to predict all failure modes. Observability enables debugging unknown unknowns.

  • Debugging Distributed Systems: Single request may span 10-20 services. Traditional logs per service cannot trace request end-to-end. Traces provide complete path and timing.
  • Root Cause Analysis: Without observability, finding root cause takes hours or days. With good instrumentation, you can pinpoint exact service and line of code.
  • Performance Optimization: Identify slowest component (database, API, cache). Traces show latency breakdown per service, per span. Metrics track trends over time.
  • Capacity Planning: Metrics show resource usage trends, request rate growth, seasonal patterns. Predict when to scale.
  • Incident Response: Dashboards for real-time visibility, automated alerts for anomalies, and correlation between metrics (e.g., error spike + CPU spike).
  • Service Level Objectives (SLOs): Measure if system meets SLOs (e.g., 99.9 percent availability, p99 latency < 200ms). Error budget tracking for release decisions.
Monitoring vs Observability:
Aspect                  Monitoring                        Observability
─────────────────────────────────────────────────────────────────────────────
Question Type           Known knowns ("Is the database up?") Unknown unknowns ("Why are requests slow?")
Data Collection         Metrics (aggregated)              Metrics + Logs + Traces
Exploration             Fixed dashboards                  Ad-hoc querying (e.g., PromQL)
Debugging               Alerts for known issues           Root cause for novel issues
Proactive/Reactive     Reactive (alert-based)              Proactive (exploration)
Maturity                Entry-level                       Advanced
Tools                  Nagios, Zabbix                    Prometheus, Grafana, Jaeger

The Three Pillars in Depth

Metrics

Metrics are numerical measurements collected over time. They are optimized for storage efficiency and query performance using aggregation. Common metric types: Counter (only increases, e.g., request count), Gauge (goes up and down, e.g., CPU usage), Histogram (distribution of values, e.g., latency), and Summary (similar to histogram, client-side percentiles). Query examples: "Rate of HTTP 500 errors per minute", "p99 latency over last hour", "CPU usage by pod".

Prometheus metrics example:
# Counter (total requests)
http_requests_total{method="GET", status="200"} 12345

# Gauge (current CPU)
node_cpu_usage{core="0"} 0.45

# Histogram (latency buckets)
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 2500
http_request_duration_seconds_bucket{le="1.0"} 2900

# Query examples
rate(http_requests_total[5m])           # RPS (last 5 minutes)
histogram_quantile(0.99, ...)           # p99 latency
sum by(service) (rate(errors[5m]))      # Errors by service

Logs

Logs are timestamped records of discrete events. Structured logging (JSON) is preferred over unstructured text for machine parsing. Include correlation IDs, trace IDs, and consistent fields (severity, timestamp, service name). Log levels: DEBUG (development), INFO (normal operation), WARN (unexpected but handled), ERROR (failure, needs attention). Avoid logging sensitive data (passwords, PII).

Structured log example (JSON):
{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123-def456",
  "user_id": "user-789",
  "message": "Payment gateway timeout",
  "error": "connection refused",
  "duration_ms": 5000,
  "retry_count": 3
}

Traces

Traces represent a request's journey through distributed services. A trace is composed of spans (individual operations). Each span has name, start time, duration, parent span ID (for hierarchy), and tags or attributes. Distributed tracing instruments libraries (HTTP clients, database drivers, message queues) to propagate context automatically.

Trace example (OpenTelemetry):
Trace ID: abc123-def456

Span 1: API Gateway (duration: 250ms)
  ├── Span 2: Auth Service (duration: 50ms)
  │     └── Span 3: Database Query (duration: 45ms)
  ├── Span 4: Order Service (duration: 150ms)
  │     ├── Span 5: Inventory Check (duration: 80ms)
  │     └── Span 6: Payment Service (duration: 60ms)
  └── Span 7: Notification Service (duration: 20ms)

Total: 250ms (API Gateway) = sum of child spans (with overlap)

Correlation IDs and Context Propagation

Correlation IDs link logs, metrics, and traces across services. A correlation ID is generated at the entry point (API gateway) and passed through all downstream services via HTTP headers (X-Request-ID, X-Correlation-ID). All logs include the correlation ID, enabling filtering by request, and traces use the same ID. Metrics can be tagged with correlation ID (for debugging, but high cardinality).

Context propagation example (HTTP headers):
Client Request:
  X-Request-ID: 550e8400-e29b-41d4-a716-446655440000

Service A (receives request):
  Log: {"request_id": "550e...", "message": "Received request"}
  Traces: Start span (name: "ServiceA.handleRequest")
  Calls Service B with header: X-Request-ID: 550e...

Service B (receives request):
  Log: {"request_id": "550e...", "message": "Processing"}
  Traces: Child span under parent from Service A

Result: End-to-end visibility

OpenTelemetry (OTel)

OpenTelemetry is the unified standard for observability instrumentation (merging OpenTracing and OpenCensus). It provides APIs, SDKs, and collector for generating, processing, and exporting telemetry data. Supports multiple languages (Java, Python, Go, JavaScript, .NET, Ruby, PHP). Vendor-agnostic (export to Prometheus, Jaeger, Datadog, New Relic, etc.). Auto-instrumentation for many libraries (HTTP, gRPC, database drivers, message queues).

OpenTelemetry example (Python):
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Initialize tracer
tracer = trace.get_tracer(__name__)

# Instrument requests library
RequestsInstrumentor().instrument()

# Create custom span
with tracer.start_as_current_span("process-payment"):
    # Business logic
    result = process_payment(amount=100)

    # Add attributes
    current_span = trace.get_current_span()
    current_span.set_attribute("payment.amount", 100)
    current_span.set_attribute("payment.status", "success")

Observability Anti-Patterns

  • Logging Everything (High Volume): Too many logs (debug level in production, large payloads). High storage costs, slow queries, and inability to find relevant events. Use structured logs, sample debug logs, and rotate logs aggressively.
  • No Structured Logging: Unstructured text logs cannot be queried efficiently. Hard to extract fields (requires regex). Use JSON logs with consistent field names (timestamp, level, message, trace_id).
  • Not Propagating Trace IDs: Logs and traces cannot be correlated. Debugging requires manually stitching across services. Always propagate trace ID via headers.
  • Over-reliance on Metrics (No Logs/Traces): Metrics tell what happened, but not why. Need logs for error details and traces for latency breakdown. Use all three pillars.
  • No Sampling Strategy for Traces: Tracing every request creates high overhead. Use head-based sampling (random 1-10 percent) or tail-based sampling (sample slow/error traces). Adjust based on traffic.
  • Instrumenting Only Happy Path: Errors not instrumented (missing error logs, no spans on failure). Instrument all code paths, including error handling, retries, and fallbacks.
Observability implementation checklist:
Metrics:
□ Instrument key metrics (RPS, error rate, latency)
□ Use histograms for latency (p50, p95, p99)
□ Add labels/tags for grouping (service, endpoint, status)
□ Set up dashboards (Grafana) and alerts

Logs:
□ Use structured logging (JSON)
□ Include correlation ID (request_id, trace_id)
□ Log at appropriate levels (info, warn, error)
□ Avoid logging sensitive data (redact PII)

Traces:
□ Instrument HTTP clients and servers
□ Propagate trace context (W3C Trace-Context)
□ Sample appropriate percentage (1-10%)
□ Capture slow requests (> threshold) even if unsampled

Infrastructure:
□ Deploy OpenTelemetry Collector
□ Aggregated metrics to Prometheus (or cloud)
□ Centralized logs to Loki (or cloud)
□ Traces to Jaeger (or cloud)

Observability Tools

Pillar Open Source Commercial
Metrics Prometheus, VictoriaMetrics, Thanos Datadog, New Relic, Dynatrace
Logs Loki, Elasticsearch (ELK), OpenSearch Datadog, Splunk, Logz.io
Traces Jaeger, Tempo, Zipkin Datadog, Honeycomb, Lightstep
All-in-One OpenTelemetry + Grafana Stack (Mimir, Loki, Tempo) Datadog, New Relic, Dynatrace
Grafana stack (LGTM - Loki, Grafana, Tempo, Mimir):
Mimir (Metrics)     ← Prometheus remote write
Loki (Logs)         ← Promtail or OpenTelemetry
Tempo (Traces)      ← OpenTelemetry OTLP
Grafana (UI)        ← Unified querying (Metrics, Logs, Traces)

Advantages:
  • Single UI for all three pillars
  • Correlate metrics → logs → traces
  • Open source, cloud-native
  • Scalable (object storage backend)

Observability Best Practices

  • Instrument Early (Shift Left): Add observability from day one (not after incidents). Instrument during feature development, not as afterthought. Test observability code (e.g., trace spans created).
  • Use OpenTelemetry (Vendor-Neutral): Avoid vendor lock-in; write instrumentation once, export to any backend; widely supported across languages. Future-proof your observability investment.
  • Define Service Level Indicators (SLIs): SLIs: what to measure (latency, error rate, throughput). SLOs: target values (p99 < 200ms). Error Budget: how much failure allowed.
  • Alert on Symptoms, Not Causes: Alert: high error rate (symptom). Not alert: CPU spike (cause) which may not affect users. Alert on user-facing metrics, investigate underlying causes when alerted.
  • High Cardinality for Debugging: Use labels for user_id, request_id, device type, region. Metrics with high cardinality are expensive but invaluable for debugging. Use traces for highest cardinality.
  • Store Telemetry Data Long Enough: Metrics: 30-90 days (trending). Logs: 15-30 days (debugging). Traces: 7-14 days (debugging recent incidents). Balance cost vs. value.
SLI/SLO definitions examples:
Service: Payment API

SLIs (Service Level Indicators):
  • Latency: 99th percentile response time < 300ms
  • Availability: successful responses / total > 99.9%
  • Throughput: requests per second
  • Error rate: HTTP 5xx responses < 0.1%

SLOs (Service Level Objectives):
  • Latency SLO: 99% requests < 300ms per month
  • Availability SLO: 99.9% uptime per month

Error Budget:
  • 1 - SLO = available error tolerance
  • Example: 0.1% errors allowed per month (~43 minutes downtime/month)

Frequently Asked Questions

  1. What is the difference between monitoring and observability?
    Monitoring checks known failure modes (pre-defined dashboards, alerts). Observability enables asking arbitrary questions about system state (exploration). Monitoring tells you something is wrong; observability tells you why.
  2. Do I need observability for small applications?
    Basics (metrics + logs) are sufficient for simple apps. Full observability (traces) becomes valuable as complexity grows (microservices > 3 services). Start with metrics and structured logs; add tracing when debugging becomes painful.
  3. What is the difference between tracing and logging?
    Logs are discrete events (timestamp + message). Traces represent end-to-end request paths across services (parent-child relationships). Logs are good for debugging individual events; traces are good for understanding distributed request flow.
  4. How much overhead does observability add?
    Metrics: minimal (1-5 percent CPU, aggregation). Logs: depends on volume (structured logging adds serialization overhead). Traces: sampling reduces overhead (default 1-10 percent). Test in staging and tune sampling rates.
  5. What is the difference between OpenTelemetry and Prometheus?
    OpenTelemetry is a standard for generating and collecting telemetry data (metrics, logs, traces). Prometheus is a metrics storage and querying system. OpenTelemetry can export to Prometheus. They are complementary, not competing.
  6. What should I learn next after observability?
    After mastering observability, explore distributed tracing deep dive (Jaeger, Tempo), Prometheus metrics and alerting, OpenTelemetry instrumentation, Service Level Objectives (SLOs) and error budgets, and Grafana dashboards and visualization.