Distributed Tracing System¶

Interview Time: 60-90 min | Difficulty: Hard
Key Focus: Span collection, trace correlation, sampling, performance analysis

Step 1: Functional & Non-Functional Requirements¶

Functional Requirements¶

Generate trace IDs at request entry (auto-generated or passed in)
Propagate trace ID through request lifecycle
Collect spans from all services touched by a request
Store span data with timing, errors, and causality
Query traces by trace ID, service, latency, error status
Visualize request flow (service A → B → C timeline)
Identify latency bottlenecks (which service is slow)
Debug errors (with stack traces and logs attached to spans)
Correlation with logs (logs contain trace ID for cross-reference)

Non-Functional Requirements¶

Requirement	Target	Notes
Span Ingestion	1M spans/sec, 100 billion spans/day	Same QPS as metrics
Latency	Span collected within 1 second	Near real-time debugging
Retention	7 days hot storage, 30 days queryable	Don't keep forever (cost)
Query Speed	Retrieve trace <500ms, visualize <1s	Interactive debugging
Sampling	10-100% sample rate (configurable)	Prevent storage explosion
Storage	~500 GB/day (with sampling)	Compression + smart retention
Accuracy	Accurate timing within 10ms	Must trust latency data

Step 2: API Design, Data Model & High-Level Design¶

Core API Endpoints¶

POST /traces/spans
  [{trace_id, span_id, parent_span_id, service, operation, duration_ms, 
    start_time_ms, tags: {endpoint, status}, error: false}]
  → {success: true}

GET /traces/{trace_id}
  → {trace_id, spans: [{service, operation, start, duration, tags, error}], total_duration_ms}

GET /traces/search
  ?service=api-gateway&operation=GetUser&latency_min=100&latency_max=5000&error=false&limit=10
  → {traces: [{trace_id, duration, services_touched}], total_count}

GET /traces/{trace_id}/analyze
  → {critical_path: [services in bottleneck order], latency_breakdown: {service: ms}}

GET /services/{service_name}/latency
  ?percentile=p99&operation=GetUser&time_range=1h
  → {percentile_value_ms: 250, samples: 10000}

Entity Data Model¶

SPANS
├─ trace_id (PK, for grouping)
├─ span_id (PK, unique within trace)
├─ parent_span_id (null for root)
├─ service_name
├─ operation_name (function or RPC name)
├─ start_time_ms, duration_ms
├─ timestamp (server receive time)
├─ tags (JSON: {http.method, http.status_code, db.query_type})
├─ logs (array: {timestamp, message, severity})
├─ error: bool, error_message (if true)
├─ baggage (trace-wide context: user_id, request_id)

TRACE_INDEX
├─ trace_id (PK)
├─ service_names (array of all services touched)
├─ root_service, root_operation
├─ start_time_ms, duration_ms
├─ error: bool, error_count
├─ root_cause_service (inferred from trace)
├─ created_at

LATENCY_HISTOGRAM (per-service)
├─ service_name, operation_name
├─ histogram_buckets [{bucket_size, count}]
├─ percentiles: {p50, p95, p99}
├─ timestamp_ms (1 min interval, pre-aggregated)

High-Level Architecture¶

graph TB
    Service1["Service 1<br/>(generates spans)"]
    Service2["Service 2"]
    ServiceN["Service N"]

    Instrumentation["Instrumentation Library<br/>(OpenTelemetry)"]
    SpanCollector["Span Collector<br/>(agent process)"]
    Kafka["Message Queue<br/>(Kafka)"]

    Sampler["Sampler<br/>(decide to keep span)"]
    Processor["Processor<br/>(enrich, aggregate)"]

    Storage["Trace Storage<br/>(Cassandra/HBase)"]
    Index["Index<br/>(Elasticsearch)"]
    Cache["Cache<br/>(Redis)"]

    QueryEngine["Query Engine<br/>(span lookup)"]
    Visualizer["Visualizer<br/>(timeline render)"]

    Analyzer["Analyzer<br/>(find bottlenecks)"]
    Monitoring["Monitoring<br/>(trace health)"]

    Service1 -->|emit spans| Instrumentation
    Service2 -->|emit spans| Instrumentation
    ServiceN -->|emit spans| Instrumentation

    Instrumentation -->|on-process| SpanCollector
    SpanCollector --> Sampler
    Sampler -->|sample| Kafka

    Kafka --> Processor
    Processor --> Storage
    Processor --> Index

    Storage --> Cache
    Cache --> QueryEngine

    QueryEngine --> Visualizer

    Storage --> Analyzer
    Analyzer --> Monitoring

Step 3: Concurrency, Consistency & Scalability¶

🔴 Problem: Capturing Trace Context Across Services¶

Scenario: Request: Client → API Gateway → Auth Service → Database. How do we link all spans into one trace?

Solution: Trace context propagation via headers

Request Header (sent by client or generated by gateway):
  X-Trace-ID: abc123  (globally unique, 128-bit UUID)
  X-Span-ID: span-1   (unique within trace)
  X-Parent-SPAN-ID: null  (no parent for root)

API Gateway:
  Receives X-Trace-ID: abc123
  Creates span: {trace_id: abc123, span_id: span-1, parent: null}
  Makes request to Auth Service with same X-Trace-ID, X-Span-ID: span-2, X-Parent-SPAN-ID: span-1

Auth Service:
  Receives X-Trace-ID: abc123, X-Parent-SPAN-ID: span-1
  Creates span: {trace_id: abc123, span_id: span-2, parent: span-1}
  Database call with updated trace context

Database:
  Receives trace context
  Creates span: {trace_id: abc123, span_id: span-3, parent: span-2}
  Returns data

All spans grouped by trace_id=abc123, forming a tree by parent_span_id

class TraceContextPropagator:
    def extract_trace_context(self, headers: dict) -> TraceContext:
        """Extract trace context from HTTP headers"""

        trace_id = headers.get('X-Trace-ID')
        span_id = headers.get('X-Span-ID')
        parent_span_id = headers.get('X-Parent-Span-ID')

        # If not present, create new trace
        if not trace_id:
            trace_id = generate_trace_id()  # 128-bit UUID
            span_id = generate_span_id()
            parent_span_id = None

        return TraceContext(
            trace_id=trace_id,
            span_id=span_id,
            parent_span_id=parent_span_id
        )

    def inject_trace_context(self, context: TraceContext) -> dict:
        """Inject trace context into outgoing request headers"""

        # Create child span for the outgoing call
        new_span_id = generate_span_id()

        return {
            'X-Trace-ID': context.trace_id,
            'X-Span-ID': new_span_id,
            'X-Parent-Span-ID': context.span_id,  # This request's span is parent of next
            'X-Trace-Flags': '01'  # Sampled
        }

def instrument_outgoing_request(self, client, method, url, headers, **kwargs):
    """Middleware to automatically propagate trace context"""

    # Get current trace from context-local storage
    current_context = current_trace_context.get()

    if current_context:
        # Inject trace context into request
        trace_headers = self.propagator.inject_trace_context(current_context)
        headers.update(trace_headers)

    # Make actual request
    response = client.request(method, url, headers=headers, **kwargs)

    return response

🟡 Problem: Storage Explosion with 100B Spans/Day¶

Scenario: 1M spans/sec × 86400 sec/day = 86B spans. With 1 KB per span = 86 TB/day. Unaffordable.

Solution: Sampling + smart retention

class SamplingStrategy:
    def should_sample_trace(self, trace_context: TraceContext, service: str) -> bool:
        """Decide whether to keep this trace or discard it"""

        # Strategy 1: Percentage sampling (keep 1% of traces)
        base_sample_rate = 0.01

        # Strategy 2: Always sample errors
        if trace_context.has_error:
            return True

        # Strategy 3: Always sample slow traces (p99 latency)
        if trace_context.duration_ms > 1000:  # Slow
            return True

        # Strategy 4: Adaptive sampling by service
        service_sample_rates = {
            'api_gateway': 0.1,     # High-traffic, lower sample
            'auth_service': 0.05,   # Less critical
            'payment': 0.5,         # Critical, sample more
        }
        sample_rate = service_sample_rates.get(service, base_sample_rate)

        # Strategy 5: User-specified via baggage
        if 'sample=true' in trace_context.baggage:
            return True  # Always sample this trace (debug mode)

        # Final decision
        decision = random.random() < sample_rate
        trace_context.sampled = decision
        trace_context.sample_rate = sample_rate

        return decision

def calculate_storage_with_sampling():
    """Calculate storage impact of sampling strategies"""

    spans_per_day = 86 * 1e9  # 86 billion
    bytes_per_span = 1000      # 1 KB average

    # No sampling
    storage_no_sampling = (spans_per_day * bytes_per_span) / (1024**4)  # TB
    print(f"No sampling: {storage_no_sampling:.1f} TB/day (too much!)")

    # 1% sampling
    storage_1pct = storage_no_sampling * 0.01
    print(f"1% sampling: {storage_1pct:.1f} TB/day (~0.86 TB, affordable)")

    # Adaptive: 1% + always sample errors (add 5% for error overhead)
    storage_adaptive = (storage_1pct * 0.95) + (spans_per_day * 1e9 * bytes_per_span * 0.05)
    print(f"Adaptive (1% + errors): {storage_adaptive:.1f} TB/day (still manageable)")

Sampling Trade-offs:

Low sampling (0.1%): 86 GB/day (cheap, but miss issues)
Moderate (1%): 0.86 TB/day (good balance, affordable)
Adaptive (1% + errors): 0.9 TB/day (catch issues, cheap)
High (10%): 8.6 TB/day (expensive, comprehensive)

💾 Problem: Fast Trace Reconstruction (Find Critical Path)¶

Scenario: Trace has 50 spans across 10 services. User wants to know: which step was slowest? Critical path analysis.

Solution: Pre-compute critical path during ingestion

class CriticalPathAnalyzer:
    def compute_critical_path(self, trace: Trace) -> List[Span]:
        """Find longest path through DAG of spans"""

        # Build span DAG
        spans_by_id = {s.span_id: s for s in trace.spans}
        children_of = defaultdict(list)

        for span in trace.spans:
            if span.parent_span_id:
                children_of[span.parent_span_id].append(span)

        # DFS to find longest path (critical path)
        def longest_path_from(span_id, memo={}):
            if span_id in memo:
                return memo[span_id]

            span = spans_by_id[span_id]
            children = children_of[span_id]

            if not children:
                # Leaf node
                path = [span]
                duration = span.duration_ms
            else:
                # Find longest child path
                longest_child_path = max(
                    (longest_path_from(c.span_id, memo) for c in children),
                    key=lambda p: sum(s.duration_ms for s in p)
                )
                path = [span] + longest_child_path
                duration = span.duration_ms + sum(s.duration_ms for s in longest_child_path)

            memo[span_id] = path
            return path

        # Root span (no parent)
        root_spans = [s for s in trace.spans if s.parent_span_id is None]
        if root_spans:
            critical_path = longest_path_from(root_spans[0].span_id)
        else:
            critical_path = []

        return critical_path

def analyze_trace_latency(trace: Trace):
    """Analyze where time is spent in trace"""

    critical_path = self.compute_critical_path(trace)

    # Group spans by service
    by_service = defaultdict(lambda: {'count': 0, 'duration_ms': 0})

    for span in trace.spans:
        by_service[span.service_name]['count'] += 1
        by_service[span.service_name]['duration_ms'] += span.duration_ms

    # Rank services by time spent (identify bottleneck)
    service_times = sorted(
        by_service.items(),
        key=lambda x: x[1]['duration_ms'],
        reverse=True
    )

    print(f"Trace {trace.trace_id}: Total {trace.duration_ms}ms")
    print(f"Critical path: {' → '.join(s.service_name for s in critical_path)}")
    print(f"Bottleneck services:")
    for service, metrics in service_times[:5]:
        pct = (metrics['duration_ms'] / trace.duration_ms) * 100
        print(f"  {service}: {metrics['duration_ms']}ms ({pct:.1f}%)")

Step 4: Persistence Layer, Caching & Monitoring¶

Database Design¶

-- Spans table (time-series)
CREATE TABLE spans (
    trace_id TEXT,
    span_id TEXT,
    parent_span_id TEXT,
    service_name TEXT,
    operation_name TEXT,
    start_time_ms INT64,
    duration_ms INT,
    tags JSONB,
    logs ARRAY<STRUCT<timestamp, message, severity>>,
    error BOOLEAN,
    error_message TEXT,
    PRIMARY KEY (trace_id, span_id)
) WITH COMPRESSION = 'Gorilla'
  AND TTL = 604800;  -- 7 days

-- Trace index (for searching)
CREATE TABLE trace_index (
    trace_id TEXT PRIMARY KEY,
    service_names SET<TEXT>,
    duration_ms INT,
    error BOOLEAN,
    root_operation TEXT,
    timestamp_ms INT,
    sample_rate FLOAT
);

-- Latency histograms (per-service, pre-aggregated)
CREATE TABLE latency_histogram (
    service TEXT,
    operation TEXT,
    timestamp_ms INT,
    percentile_p50 INT,
    percentile_p95 INT,
    percentile_p99 INT,
    samples_count INT,
    PRIMARY KEY (service, operation, timestamp_ms DESC)
);

CREATE INDEX idx_spans_service_time 
ON spans(service_name, start_time_ms DESC);

CREATE INDEX idx_trace_index_duration 
ON trace_index(duration_ms DESC);

CREATE INDEX idx_trace_index_timestamp 
ON trace_index(timestamp_ms DESC);

Caching Strategy¶

TIER 1: Hot Trace Cache (Redis)
├─ Recently accessed traces (TTL: 24 hours)
├─ Pre-computed critical paths
└─ Frequently searched traces

TIER 2: Latency Histogram Cache
├─ Service latency stats (p50, p95, p99)
├─ Updated every minute
└─ TTL: 1 hour (safe for stats)

TIER 3: Persistent Storage (Cassandra/HBase)
├─ All spans (7 days), then archived to S3
├─ Trace index for fast lookup
└─ Pre-computed criticalpath for recent traces

Invalidation:
- New span arrived → invalidate cache for that trace
- Trace completed → pre-compute critical path
- Histogram update → refresh service statistics

Monitoring & Alerts¶

Key Metrics:

Span Ingestion Rate — Spans/sec (target: 1M/sec)
P99 Latency — Distribution of latencies (track by service)
Error Rate by Service — % of traces with errors (target: <1%)
Query Performance — Trace retrieval time (target: <500ms)
Sampling Rate — % of traces sampled (monitor for changes)

- alert: LowSpanIngestion
  expr: rate(spans_ingested_total[5m]) < 500000
  annotations: "Span ingestion rate low: {{$value}}/sec"

- alert: TraceQueryLatencySlow
  expr: histogram_quantile(0.99, rate(trace_query_duration_seconds_bucket[5m])) > 0.5
  annotations: "Trace query p99: {{$value}}s (target: <500ms)"

- alert: ServiceLatencySpiking
  expr: histogram_quantile(0.99, rate(service_latency_seconds_bucket{service="payment"}[5m])) > 1.0
  for: 5m
  annotations: "Payment service p99 latency: {{$value}}s (spike!)"

- alert: ErrorRateHigh
  expr: count(span_error=true) / count(span) > 0.05
  for: 5m
  annotations: "Error rate >5% ({{$value | humanizePercentage}})"

- alert: TracesBacklogged
  expr: trace_processing_queue_size > 1000000
  annotations: "Span queue depth: {{$value}} messages (inspect sampler/processor)"

- alert: StorageGrowthRapid
  expr: rate(trace_storage_bytes[1h]) > 50 * 1024 * 1024 * 1024
  annotations: "Trace storage growth: {{$value | humanize}}B/hour (check sampling)"

⚡ Quick Reference Cheat Sheet¶

When to Use What¶

Need	Technology	Why
Trace context propagation	OpenTelemetry libraries	Standard format, language-agnostic
Span ingestion	Kafka or Redis Streams	Durable queue, handles spikes
Storage	Cassandra or HBase	Time-series optimized, distributed
Indexing	Elasticsearch	Full-text search by operation, service
Sampling	Percentage + adaptive + error-aware	Control storage cost, catch issues
Critical path	Pre-compute during ingestion	Fast analysis queries

Critical Design Decisions¶

Trace ID propagation: Use HTTP headers (X-Trace-ID, X-Span-ID) for automatic tracking
Sampling strategy: 1% baseline + always sample errors/slow traces (p99 latency)
Storage: 7d raw (hot), 30d queryable index, archive beyond
Critical path: Pre-compute at ingestion time for instant analysis
Baggage context: Embed user_id, endpoint, status in trace for filtering
Causality tracking: Use parent_span_id to reconstruct request flow

Tech Stack Summary¶

Instrumentation: OpenTelemetry (Java, Python, Node, Go, etc.)
Trace Collection: Custom agent or Jaeger Agent
Queue: Kafka (durable queue)
Storage: Cassandra or HBase (time-series)
Index: Elasticsearch (full-text search)
Visualization: Jaeger UI or custom
Analytics: Pre-aggregated histograms

🎯 Interview Summary (5 Minutes)¶

Trace context propagation: HTTP headers (X-Trace-ID, X-Span-ID) automatically link spans across services
Sampling strategy: 1% baseline + always sample errors + slow requests (p99) to control storage cost
Span storage: Cassandra/HBase with 7d hot, 30d queryable, then archive
Critical path analysis: Pre-compute longest path through span DAG at ingestion time
Service latency rankings: Group spans by service, find bottleneck (most time spent where)
Error debugging: Attach logs, stack traces, tags to spans; correlate with application logs
Query optimization: Index by service/operation, cache recent traces, use aggregated histograms