Distributed Tracing: Track Requests Across Microservices

Distributed tracing is a method used to track the journey of requests as they flow through distributed systems or microservices architectures. It helps identify performance bottlenecks, troubleshoot errors, and optimize system performance.

Distributed Tracing: Track Requests Across Microservices

Distributed tracing is a method used to track the journey of requests as they flow through distributed systems or microservices architectures. In modern applications, a single user action can trigger a cascade of requests across numerous independent services, databases, and external APIs. Distributed tracing provides visibility into this complex web of interactions, helping developers identify performance bottlenecks, troubleshoot errors, and optimize system performance [citation:7].

As applications evolve from monolithic architectures to distributed microservices, traditional debugging methods become insufficient. A single request might traverse dozens of services, making it nearly impossible to understand where delays or failures occur without proper tracing. To understand distributed tracing properly, it is helpful to be familiar with microservices architecture, REST APIs, and observability concepts.

What you will learn in this tutorial:
✓ What distributed tracing is and why it matters
✓ Core concepts: traces, spans, and context propagation
✓ How tracing works across microservices
✓ W3C Trace Context standard and headers
✓ OpenTelemetry and tracing tools (Jaeger, Datadog)
✓ Sampling strategies and best practices
✓ Implementing tracing in your applications

What Is Distributed Tracing

Distributed tracing is a methodology that follows, analyzes, and debug a transaction across multiple software components [citation:1]. It tracks the complete end-to-end path of a request as it flows through a distributed system, representing the journey of a specific operation as it traverses various components and services in a distributed architecture [citation:9].

Unlike traditional monitoring that looks at individual services in isolation, distributed tracing connects the dots between services. It shows you exactly how a request moves from one service to another, how long each operation takes, and where errors or delays occur. This end-to-end visibility is essential for understanding system behavior in microservices environments [citation:7].

Distributed tracing in simple terms:
User Request → API Gateway → Auth Service → User Service → Database → Response
                    │              │              │            │
                    ▼              ▼              ▼            ▼
                Span 1          Span 2          Span 3       Span 4
                    │              │              │            │
                    └──────────────┴──────────────┴────────────┘
                                   │
                                   ▼
                          Complete Trace (Trace ID)
                          Shows timing and dependencies

Why Distributed Tracing Matters

In monolithic applications, debugging was straightforward. You could look at logs, find the error, and fix it. In microservices, a single request can involve dozens of services running on different servers, in different data centers, written by different teams. Distributed tracing provides the visibility needed to understand these complex interactions [citation:7].

  • Faster Issue Detection: Identify exactly which service is causing slow responses or errors without guessing.
  • Improved Application Performance: Pinpoint performance bottlenecks and optimize the slowest parts of the request flow [citation:7].
  • Enhanced Visibility: Understand how services interact and depend on each other in complex systems [citation:7].
  • Better Collaboration: Different teams can see how their services contribute to the overall request flow.
  • SLA Compliance: Monitor latency, error rates, and throughput across services to ensure service level agreements are met [citation:7].
  • Root Cause Analysis: Quickly find the source of errors or unexpected behavior in distributed systems [citation:9].

Core Concepts of Distributed Tracing

Traces

A trace is the complete end-to-end path of a request as it flows through a distributed system. It represents the journey of a specific operation as it traverses various components and services. Each trace has a unique identifier (Trace ID) that stays consistent across all services involved [citation:6][citation:9].

Trace visualization example:
Trace ID: 0af7651916cd43dd8448eb211c80319c

Timeline:
──|───────|───────|───────|───────|───────|───────|───────|──→ time

[API Gateway··············································]
   [Auth Service·····································]
      [User Service······························]
         [Database Query··]
      [Response Processing·]
   [Response·············]

Spans

A span represents a single operation or unit of work within a trace. Each span has a start time, end time, duration, operation name, and metadata. Spans can have parent-child relationships, creating a hierarchy that shows how operations are nested or called sequentially [citation:6].

  • Root Span: The first span in a trace, representing the initial request.
  • Child Span: A span created as a result of a parent span, representing a sub-operation.
  • Span Context: Contains identifiers (Trace ID, Span ID) and other metadata passed between services.
  • Span Tags/Attributes: Key-value pairs that add context (HTTP method, status code, error details).
  • Span Events/Logs: Timestamped records within a span for specific occurrences.
Span structure example:
Span:
  Operation Name: HTTP GET /api/users
  Start Time: 10:30:00.000
  End Time: 10:30:00.150
  Duration: 150ms
  Tags:
    - http.method: GET
    - http.status_code: 200
    - service.name: user-service
  Child Spans:
    - SQL SELECT (duration: 50ms)
    - Redis GET (duration: 20ms)

Trace Context Propagation

For distributed tracing to work, trace information must be passed from one service to another. This is called context propagation. The W3C Trace Context standard defines a universally agreed-upon format for exchanging trace context data, ensuring interoperability across different tracing vendors and platforms [citation:1][citation:8].

W3C traceparent header example:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
              │  │                                 │                 │
              │  │                                 │                 └── Flags
              │  │                                 └── Span ID (8 bytes)
              │  └── Trace ID (16 bytes / 32 hex chars)
              └── Version (00)

tracestate: congo=t61rcWkgMzE
(baggage for vendor-specific data)

How Distributed Tracing Works

Distributed tracing follows a consistent process that spans from request initiation to data visualization. Here is how it works step by step [citation:7]:

  1. Trace ID Assignment: When a request enters the system, a unique trace ID is assigned. This ID remains consistent throughout the request's journey.
  2. Span Creation: Each service that processes the request creates a span, recording start time, end time, operation name, and metadata.
  3. Context Propagation: Trace and span identifiers are passed between services via HTTP headers or gRPC metadata [citation:1].
  4. Data Collection: All spans are sent to a tracing backend (collector) for storage and analysis.
  5. Trace Reconstruction: Spans are assembled into complete traces using their parent-child relationships.
  6. Visualization: Traces are displayed in a UI for analysis, showing timing, dependencies, and errors.
Request flow with tracing:
Client Request
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│ API Gateway                                                  │
│ • Creates root span                                          │
│ • Injects traceparent header into request                    │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│ Auth Service                                                 │
│ • Extracts traceparent header                                │
│ • Creates child span                                         │
│ • Propagates context to next service                         │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│ User Service                                                 │
│ • Extracts traceparent header                                │
│ • Creates child span                                         │
│ • Records database query spans                               │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
All spans sent to tracing backend (Jaeger, Datadog, etc.)
    │
    ▼
Trace reconstructed and visualized

W3C Trace Context Standard

The W3C Trace Context specification defines a standardized format for propagating trace context across services. Before this standard, each tracing vendor had its own format, causing interoperability problems. Today, W3C Trace Context is the industry standard, supported by all major tracing tools and cloud platforms [citation:1][citation:8].

Header Purpose Format
traceparent Carries trace ID, span ID, and flags 00-{trace-id}-{span-id}-{flags}
tracestate Carries vendor-specific trace information vender_key=value (comma separated)
baggage Carries arbitrary key-value pairs for context propagation key1=value1,key2=value2
HTTP header propagation example:
GET /api/users HTTP/1.1
Host: user-service.example.com
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
baggage: userId=cassie,serverNode=DF%2028

OpenTelemetry: The Unified Standard

OpenTelemetry (OTel) is an open-source observability framework that provides a single set of APIs, libraries, and SDKs for collecting telemetry data (traces, metrics, logs). It was formed by the merger of OpenTracing and OpenCensus and is now the industry standard for distributed tracing [citation:4][citation:10].

  • Vendor-Agnostic: Write instrumentation once, send to any backend (Jaeger, Datadog, Zipkin, etc.).
  • Standardized APIs: Consistent APIs across different programming languages.
  • Automatic Instrumentation: Libraries for common frameworks (HTTP, gRPC, databases, messaging).
  • Context Propagation: Built-in support for W3C Trace Context and other propagators.
  • Active Community: Supported by major cloud providers and observability vendors.
OpenTelemetry architecture:
┌─────────────────────────────────────────────────────────────┐
│                    Instrumented Application                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ OpenTelemetry│  │ OpenTelemetry│  │ OpenTelemetry│         │
│  │    SDK      │  │    SDK      │  │    SDK      │         │
│  │   (Trace)   │  │   (Metric)  │  │   (Log)     │         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
│         │                │                │                 │
│         └────────────────┼────────────────┘                 │
│                          │                                  │
│                    ┌─────▼─────┐                            │
│                    │  OTLP     │                            │
│                    │ Exporter  │                            │
│                    └─────┬─────┘                            │
└──────────────────────────┼──────────────────────────────────┘
                           │
                           ▼
                  ┌─────────────────┐
                  │  Collector      │
                  │  (Optional)     │
                  └────────┬────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │  Jaeger  │    │ Datadog  │    │  Zipkin  │
    └──────────┘    └──────────┘    └──────────┘

Sampling Strategies

Distributed tracing can generate massive amounts of data. In high-traffic systems, tracing every request can be prohibitively expensive. Sampling is the practice of selectively capturing a subset of traces to balance visibility with cost and performance [citation:10].

Head-Based Sampling

In head-based sampling, the decision to sample is made at the beginning of the trace (the root span). This decision is then propagated downstream through the trace context. All spans in the trace share the same sampling decision, ensuring complete traces [citation:10].

  • Pros: Efficient, simple to implement, guarantees complete traces.
  • Cons: May miss errors or high-latency traces that aren't known at the start.
  • Use Case: General purpose, high-volume systems.

Tail-Based Sampling

In tail-based sampling, the decision is made after the entire trace is complete. This allows sampling decisions to be based on the trace's outcome (error, latency, specific attributes). However, it requires buffering spans until the trace is complete [citation:10].

  • Pros: Captures all error traces and high-latency requests.
  • Cons: Requires additional infrastructure, more complex to implement.
  • Use Case: Critical systems where error traces must never be missed.
Sampling comparison:
Head-Based Sampling:
Decision at root span → propagated downstream → complete or not at all
Pros: Simple, complete traces
Cons: May miss errors

Tail-Based Sampling:
All spans collected → decision after trace complete → keep only important ones
Pros: Captures all errors and slow requests
Cons: Complex, requires buffering

Recommended approach: Head-based sampling + retention filters for errors

RED Metrics

RED metrics are key indicators of service health derived from trace data. They provide high-level insights into application performance without examining individual traces [citation:6][citation:10].

Metric Description Example
Rate Number of requests per second 100 requests/second
Errors Number of failed requests per second 2 errors/second (2% error rate)
Duration Time requests take to complete (distribution) p95: 150ms, p99: 300ms

Unlike trace data that may be sampled, RED metrics are typically calculated from 100% of traffic, providing accurate, reliable insights for dashboards, alerts, and service level objectives (SLOs) [citation:10].

Popular Distributed Tracing Tools

Tool Type Key Features
Jaeger Open source CNCF project, Uber origin, supports OpenTelemetry
Datadog APM Commercial Head-based sampling, trace metrics, full-stack observability
Dynatrace Commercial AI-powered root cause analysis, automatic instrumentation
New Relic Commercial Distributed tracing, service maps, integration with logs and metrics
Honeycomb Commercial High-cardinality data, real-time analysis, anomaly detection
Zipkin Open source Google Dapper paper implementation, simple and lightweight

Implementing Distributed Tracing

Implementing distributed tracing involves instrumenting your applications to generate and propagate trace data. Here is a high-level approach:

  1. Choose a Tracing Backend: Select Jaeger (open source), Datadog, or another tool.
  2. Instrument Services: Use OpenTelemetry SDKs to add tracing to your applications.
  3. Configure Propagation: Ensure trace context is propagated between services via HTTP headers.
  4. Add Custom Spans: Instrument important operations (database queries, API calls, business logic).
  5. Set Up Sampling: Configure sampling rates based on your traffic volume and budget.
  6. Visualize and Alert: Use the tracing UI to analyze traces and set up alerts on RED metrics.
OpenTelemetry instrumentation example (Python):
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure exporter to send to Jaeger
otlp_exporter = OTLPSpanExporter(endpoint="http://jaeger:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

# Create custom spans
@app.route('/api/users')
def get_users():
    with tracer.start_as_current_span("get-users-from-db") as span:
        span.set_attribute("db.system", "postgresql")
        users = db.query("SELECT * FROM users")
        return users

Distributed Tracing vs Logging vs Metrics

Aspect Tracing Logging Metrics
Purpose Track request flow across services Record discrete events Measure aggregated data
Granularity Request-level Event-level Aggregated (counts, rates)
Data Volume Medium to High High Low to Medium
Use Case Find bottlenecks, debug distributed systems Detailed debugging, error investigation Monitoring, alerting, capacity planning
Example "User request took 2 seconds, database query was slow" "User 123 failed to authenticate at 10:30:05" "99th percentile latency is 150ms"

Common Distributed Tracing Mistakes to Avoid

  • Not Propagating Context: The most common mistake. Trace context must be passed between services via headers; otherwise, traces are broken [citation:8].
  • Over-Sampling: Tracing every request in high-traffic systems creates massive data volumes and costs [citation:10].
  • Under-Sampling: Sampling too aggressively may miss critical error traces.
  • No Custom Spans: Auto-instrumentation provides basic spans, but you need custom spans for business logic.
  • Incomplete Instrumentation: All services in the request path must be instrumented for complete traces.
  • Missing Span Tags: Without useful tags, traces are hard to search and analyze.
  • No Sampling Strategy: Without a strategy, you either pay too much or miss important data [citation:10].

Distributed Tracing Best Practices

  • Use OpenTelemetry: Standardize on OpenTelemetry for vendor-agnostic instrumentation [citation:4][citation:10].
  • Propagate Context Everywhere: Ensure all services propagate trace context, including proxies, message queues, and databases.
  • Add Business Context as Tags: Include user IDs, order IDs, or tenant IDs as span tags for better filtering.
  • Use Consistent Naming: Follow naming conventions for spans (e.g., `HTTP GET /users`, `DB SELECT users`).
  • Sample Strategically: Use head-based sampling for general visibility and tail-based sampling for error capture [citation:10].
  • Integrate with Logs: Include trace IDs in your logs to correlate log entries with traces.
  • Monitor Tracing Overhead: Tracing adds CPU and memory overhead; monitor and optimize sampling rates.
Trace ID injection into logs (Python):
import logging
from opentelemetry import trace

class TraceIdFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        if span:
            ctx = span.get_span_context()
            record.trace_id = format(ctx.trace_id, '032x')
            record.span_id = format(ctx.span_id, '016x')
        else:
            record.trace_id = 'no-trace'
            record.span_id = 'no-span'
        return True

logging.basicConfig(
    format='%(asctime)s [%(trace_id)s] %(levelname)s: %(message)s'
)
logging.getLogger().addFilter(TraceIdFilter())

# Now every log entry includes the trace ID for correlation

Frequently Asked Questions

  1. What is the difference between distributed tracing and logging?
    Distributed tracing tracks the flow of a request across services, showing timing and dependencies. Logging records discrete events at specific points. Tracing provides the "map" of the journey; logs provide the "details" at each stop [citation:7].
  2. What is a trace ID?
    A trace ID is a unique identifier assigned to a request when it enters the system. It remains consistent across all services the request traverses, allowing spans to be grouped into a complete trace [citation:6].
  3. What is a span?
    A span is a single operation within a trace. It represents a unit of work, such as an HTTP request, database query, or function call, and includes start time, end time, duration, and metadata [citation:6].
  4. What is context propagation?
    Context propagation is the mechanism of passing trace and span identifiers between services, typically via HTTP headers or gRPC metadata. It ensures that traces remain continuous across service boundaries [citation:1][citation:8].
  5. What is the difference between head-based and tail-based sampling?
    Head-based sampling decides to keep or drop a trace at the beginning. Tail-based sampling waits until the trace is complete to decide. Head-based is simpler; tail-based ensures errors are always captured [citation:10].
  6. What should I learn next after distributed tracing?
    After mastering distributed tracing, explore observability, OpenTelemetry, metrics and monitoring, and microservices observability for comprehensive system visibility.