Service Mesh — Deep Dive

Level: Advanced
Pre-reading: 05 · API & Communication · 09 · Deployment & Infrastructure


What is a Service Mesh?

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It uses sidecar proxies deployed alongside each service.

graph TD
    subgraph Control Plane
        CP[Istiod]
    end
    subgraph Data Plane
        subgraph Pod A
            A[App A]
            EA[Envoy]
        end
        subgraph Pod B
            B[App B]
            EB[Envoy]
        end
    end
    CP -->|Config| EA
    CP -->|Config| EB
    A --> EA
    EA -->|mTLS| EB
    EB --> B

Why Service Mesh?

Without Mesh With Mesh
Each service implements retries Mesh handles retries
Each language needs its own circuit breaker library Consistent across languages
mTLS implemented per service Automatic mTLS
Distributed tracing requires code changes Automatic trace propagation
Traffic management requires code Declarative traffic rules

Service Mesh Components

Data Plane

The data plane is the collection of sidecar proxies that intercept all network traffic.

Component Description
Sidecar proxy Envoy, Linkerd proxy
Traffic interception iptables rules redirect traffic
Protocol handling HTTP/1.1, HTTP/2, gRPC, TCP

Control Plane

The control plane manages and configures the proxies.

Component Description
Configuration API VirtualService, DestinationRule
Service discovery Kubernetes API, Consul
Certificate authority Issues mTLS certificates
Telemetry collection Aggregates metrics, traces

Istio Architecture

graph TD
    subgraph Control Plane
        I[Istiod]
        I --> CA[Certificate Authority]
        I --> C[Config Management]
        I --> D[Discovery]
    end
    subgraph Data Plane
        E1[Envoy]
        E2[Envoy]
        E3[Envoy]
    end
    I -->|xDS| E1
    I -->|xDS| E2
    I -->|xDS| E3
    E1 <-->|mTLS| E2
    E2 <-->|mTLS| E3
Istio Component Purpose
Istiod Unified control plane (Pilot, Citadel, Galley)
Envoy Sidecar proxy
Gateway Ingress/egress traffic

Traffic Management

VirtualService

Define routing rules for traffic.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: orders
spec:
  hosts:
    - order-service
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: order-service
            subset: v2
    - route:
        - destination:
            host: order-service
            subset: v1
          weight: 90
        - destination:
            host: order-service
            subset: v2
          weight: 10

DestinationRule

Define policies for traffic to a destination.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: orders
spec:
  host: order-service
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
    loadBalancer:
      simple: LEAST_CONN
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 60s
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

Resilience Features

Circuit Breaking

trafficPolicy:
  outlierDetection:
    consecutive5xxErrors: 5
    interval: 10s
    baseEjectionTime: 30s
    maxEjectionPercent: 50

Retries

http:
  - route:
      - destination:
          host: order-service
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,reset,connect-failure

Timeouts

http:
  - route:
      - destination:
          host: order-service
    timeout: 10s

Security: mTLS

Service mesh provides mutual TLS automatically.

sequenceDiagram
    participant A as Service A
    participant EA as Envoy A
    participant EB as Envoy B
    participant B as Service B

    A->>EA: HTTP request
    EA->>EB: mTLS encrypted
    EB->>B: HTTP request
    B->>EB: Response
    EB->>EA: mTLS encrypted
    EA->>A: Response

PeerAuthentication

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT  # Require mTLS for all services

AuthorizationPolicy

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: order-service-policy
spec:
  selector:
    matchLabels:
      app: order-service
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/production/sa/payment-service"]
      to:
        - operation:
            methods: ["GET", "POST"]
            paths: ["/api/orders/*"]

Observability

Service mesh provides observability without code changes.

Metrics (Prometheus)

# Istio exports these automatically
istio_requests_total{...}
istio_request_duration_milliseconds{...}
istio_tcp_connections_opened_total{...}

Distributed Tracing

graph LR
    A[Service A] -->|span| B[Service B]
    B -->|span| C[Service C]
    A -->|trace| J[Jaeger]
    B -->|trace| J
    C -->|trace| J

Envoy propagates trace headers automatically (B3, W3C TraceContext).

Access Logs

{
  "protocol": "HTTP/2",
  "upstream_service": "order-service",
  "response_code": 200,
  "response_flags": "-",
  "duration": 45,
  "request_id": "abc-123"
}

Service Mesh Options

Mesh Proxy Key Features
Istio Envoy Full-featured; complex
Linkerd Linkerd2-proxy Lightweight; simpler
Consul Connect Envoy HashiCorp ecosystem
AWS App Mesh Envoy AWS-native
Kuma Envoy CNCF; multi-cluster

Selection Criteria

Factor Recommendation
Simplicity Linkerd
Features Istio
AWS native App Mesh
Multi-cloud Istio, Kuma
Existing HashiCorp Consul Connect

When to Use Service Mesh

Good Fit

Scenario Why Mesh Helps
10+ microservices Consistent policies at scale
Polyglot services Language-agnostic features
Zero-trust security Automatic mTLS
Complex traffic management Canary, A/B, fault injection
Observability gaps Automatic metrics and traces

Poor Fit

Scenario Why Not
< 5 services Overhead not justified
Simple routing K8s Services suffice
Resource constrained Sidecar overhead
Team unfamiliar Learning curve

Mesh Overhead

Resource Per Pod Notes
Memory 50-100 MB Envoy sidecar
CPU 0.1-0.2 cores Processing traffic
Latency 1-3 ms Additional hop

Anti-Patterns

Anti-Pattern Problem Fix
Mesh for everything Overhead on simple apps Use mesh where needed
Ignoring sidecar health App works, mesh doesn't Include sidecar in health checks
Complex routing logic Hard to understand Keep routing simple
No gradual rollout Breaking changes Canary the mesh itself

When should you use a service mesh vs implementing resilience in the application?

Use a service mesh when: (1) You have many services in different languages. (2) You need consistent mTLS. (3) You want observability without code changes. Use application libraries (Resilience4j) when: (1) Homogeneous stack. (2) Fine-grained control needed. (3) Sidecar overhead unacceptable.

What's the difference between Istio and Linkerd?

Istio is feature-rich (traffic management, security, observability) but complex. Uses Envoy, resource-heavy. Linkerd is lightweight, simpler, focuses on reliability. Custom Rust proxy, lower overhead. Choose Istio for features; Linkerd for simplicity.

How does mTLS work in a service mesh?

(1) Control plane acts as Certificate Authority — issues short-lived certs to each workload. (2) Certs are rotated automatically (every 24h typically). (3) Sidecars intercept traffic and establish mTLS connections. (4) PeerAuthentication policy enforces mTLS mode (STRICT, PERMISSIVE). (5) AuthorizationPolicy controls which services can communicate.