Observability
You can't fix what you can't see. Observability is the foundation of reliable operations.
The Three Pillars
graph TD
O[Observability] --> L[Logs - What happened]
O --> M[Metrics - How much and how fast]
O --> T[Traces - Where did time go]
| Pillar | Question Answered | Tool Stack |
|---|---|---|
| Logs | What happened? What errors occurred? | ELK Stack, Grafana Loki, Splunk |
| Metrics | How many requests? CPU? Error rate? | Prometheus + Grafana, Datadog |
| Traces | Where did this request spend its time across services? | Jaeger, Zipkin, AWS X-Ray, Honeycomb |
Observability vs Monitoring
Monitoring answers known questions (is CPU high?). Observability lets you ask unknown questions about system behavior from the outside.
Logging
| Best Practice | Description |
|---|---|
| Structured logging | Emit JSON; machine-parseable; fields queryable by log aggregator |
| Correlation ID | Unique per-request ID propagated through all services via headers |
| Log levels | ERROR (needs action), WARN (degraded), INFO (business events), DEBUG (dev only — never in prod) |
| Never log secrets | No PII, tokens, passwords, or card numbers in logs — ever |
| Centralize | Ship to central store (not pod local); pods are ephemeral |
ELK Stack flow:
→ Deep Dive: Logging — Structured logging, correlation IDs, ELK stack, log levels
Metrics
| Metric Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing; resets on restart | Total HTTP requests, total errors |
| Gauge | Current snapshot; can go up or down | Active connections, memory used |
| Histogram | Distribution of values in configurable buckets | Request latency (50ms, 100ms, 500ms buckets) |
| Summary | Pre-calculated quantiles client-side | P50, P95, P99 latency |
RED Method — For Service Health
| Letter | Metric |
|---|---|
| R — Rate | Requests per second |
| E — Errors | Error rate (%) |
| D — Duration | Latency distribution (P50, P95, P99) |
USE Method — For Resource Health
| Letter | Metric |
|---|---|
| U — Utilization | How busy is the resource (%) |
| S — Saturation | How much work is queued / waiting |
| E — Errors | Errors on the resource |
→ Deep Dive: Metrics — Counter/Gauge/Histogram, RED and USE methods, Prometheus
Distributed Tracing
sequenceDiagram
participant Client
participant GW as API Gateway
participant OS as Order Service
participant PS as Payment Service
Client->>GW: Request TraceId=abc SpanId=1
GW->>OS: TraceId=abc SpanId=2 ParentId=1
OS->>PS: TraceId=abc SpanId=3 ParentId=2
PS-->>OS: 200 OK
OS-->>GW: 200 OK
GW-->>Client: 200 OK
| Concept | Description |
|---|---|
| Trace | Full end-to-end journey of one request across all services |
| Span | One operation within a trace; has start time, duration, and metadata |
| Context Propagation | TraceId + SpanId passed via HTTP headers (B3 format or W3C TraceContext) |
| OpenTelemetry (OTel) | CNCF vendor-neutral standard; collects logs, metrics, and traces |
| Head-based sampling | Decide to trace at request entry; consistent but may miss errors |
| Tail-based sampling | Decide after seeing full trace; captures errors; more complex |
→ Deep Dive: Distributed Tracing — Spans, context propagation, sampling strategies
SLI · SLO · SLA · Error Budget
| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | The metric being measured | successful_requests / total_requests |
| SLO (Objective) | Target value for the SLI | 99.9% availability |
| SLA (Agreement) | Contractual commitment with penalties | 99.5% or 10% refund |
| Error Budget | 100% - SLO target |
0.1% = ~43.8 min/month allowed downtime |
Error budgets drive decisions: if the budget is burning fast, freeze feature deployments and focus on reliability.
→ Deep Dive: SLI, SLO, SLA — Error budgets, burn rate alerting, SRE approach
Health Checks & Kubernetes Probes
| Probe | Question | Pod Action if Fails |
|---|---|---|
| livenessProbe | Is the app alive (not stuck/deadlocked)? | Restart the pod |
| readinessProbe | Is the app ready to receive traffic? | Remove from Service endpoints |
| startupProbe | Has slow-starting app finished initializing? | Pauses liveness/readiness checks |
Alerting Best Practices
| Principle | Description |
|---|---|
| Alert on symptoms, not causes | Alert on SLO burn rate (user impact), not individual CPU spikes |
| Avoid alert fatigue | Too many alerts → all ignored; tune and reduce ruthlessly |
| Actionable alerts only | Every alert should require a human action |
| On-call rotation | Use PagerDuty / OpsGenie; document runbooks for every alert |
| Escalation paths | Define who gets paged and when escalation triggers |
OpenTelemetry (OTel) in Spring Boot
| Component | Description |
|---|---|
| SDK | Instrument your app (auto or manual) |
| Collector | Receives, processes, and exports telemetry |
| Exporters | Send to Jaeger, Zipkin, Prometheus, Datadog, etc. |
| Spring Boot support | spring-boot-starter-actuator + Micrometer + OTel integration |
→ Deep Dive: OpenTelemetry — SDK, Collector, auto-instrumentation, Spring Boot integration
DORA Metrics
DevOps Research and Assessment (DORA) measures software delivery performance:
| Metric | Measures | Target |
|---|---|---|
| Deployment Frequency | How often code reaches production | 3+ per day (elite) |
| Lead Time for Changes | Code commit → production | < 1 hour (elite) |
| Change Failure Rate | % of deployments causing incidents | 0–15% (elite) |
| Mean Time to Recovery | Time to restore service after incident | < 1 hour (elite) |
Elite performers ship 3+ times per day with low defect rates and recover from incidents in under an hour.
→ Deep Dive: DORA Metrics — Deployment frequency, lead time, change failure rate, MTTR measurement and improvement