Distributed Tracing — Deep Dive

Level: Intermediate
Pre-reading: 07 · Observability

What is Distributed Tracing?

Distributed tracing tracks a request as it flows through multiple services, showing where time is spent.

graph LR
    subgraph Trace abc123
        S1[Gateway - 10ms] --> S2[Order - 45ms]
        S2 --> S3[Payment - 200ms]
        S2 --> S4[Inventory - 30ms]
    end

Tracing Concepts

Concept	Description
Trace	End-to-end journey of a request
Span	Single operation within a trace
Trace ID	Unique identifier for the trace
Span ID	Unique identifier for each span
Parent Span ID	Links child spans to parent

Context Propagation

Trace context travels in HTTP headers:

W3C TraceContext (Standard)

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor=value

B3 (Zipkin)

X-B3-TraceId: 463ac35c9f6413ad48485a3953bb6124
X-B3-SpanId: a2fb4a1d1a96d312
X-B3-ParentSpanId: 0020000000000001
X-B3-Sampled: 1

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for traces, metrics, and logs.

graph LR
    A[Application + OTel SDK] --> C[OTel Collector]
    C --> J[Jaeger]
    C --> P[Prometheus]
    C --> L[Loki]

Spring Boot Integration

<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>

management:
  tracing:
    sampling:
      probability: 1.0  # 100% sampling (reduce in prod)
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

Sampling Strategies

Strategy	Description	Use Case
Head-based	Decide at request entry	Simple; may miss errors
Tail-based	Decide after seeing full trace	Captures errors; more complex
Rate limiting	Sample N traces/second	Control cost

Tracing Tools

Tool	Type	Notes
Jaeger	OSS	CNCF; Kubernetes-native
Zipkin	OSS	Twitter origin; simple
Tempo	OSS	Grafana; cost-effective storage
AWS X-Ray	Managed	AWS native
Honeycomb	SaaS	High-cardinality analysis

What's the difference between tracing and logging?

Tracing shows the flow and timing of a request across services. Logging captures discrete events with context. Traces answer "where did time go?"; logs answer "what happened?". Use both together; link logs to traces via trace ID.

How do you decide on sampling rate?

Start with 100% in dev. In production: (1) Cost constraints — traces are expensive. (2) Traffic volume — high-traffic needs lower rate. (3) Error capture — use tail-based to always capture errors. Common: 1-10% for high-traffic, 100% for critical paths.

Head-based vs tail-based sampling?

Head-based decides at trace start — simple but may miss errors. Tail-based decides after seeing complete trace — captures all errors and slow traces. Use head-based for most cases; add tail-based for critical paths to ensure error capture.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search