08: Observability
The Three Pillars
1. Metrics (What?)
Numeric measurements over time.
Examples:
- CPU usage: 75%
- Memory: 512MB / 1GB
- Request latency: 45ms
- Error rate: 0.5%
- HTTP requests/sec: 1000
Tool: Prometheus (time-series database)
2. Logs (Why?)
Discrete events with context.
Examples:
2024-01-15T10:23:45Z ERROR Database connection failed: timeout
2024-01-15T10:23:46Z INFO Retrying connection (attempt 2/5)
2024-01-15T10:23:47Z INFO Connection established to postgres:5432
Tool: Loki (log aggregation)
3. Traces (How?)
Request path across microservices.
Example:
Request to API Server
├─ Auth Service (5ms)
├─ Database Query (20ms)
├─ Cache Lookup (2ms)
└─ Response (2ms)
Total: 29ms
Tool: Jaeger (distributed tracing)
Prometheus (Metrics)
Architecture
/metrics"] -->|scrape| B["Prometheus
(TSDB)"] C["Container B
/metrics"] -->|scrape| B D["Container C
/metrics"] -->|scrape| B B -->|query| E["Grafana
(Dashboard)"] B -->|query| F["AlertManager
(Alerts)"] style B fill:#f3e5f5 style E fill:#e3f2fd style F fill:#ffe0b2
Setup
prometheus.yml (configuration):
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Querying (PromQL)
# Current CPU usage
rate(container_cpu_usage_seconds_total[5m])
# Request latency p95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Memory usage
container_memory_usage_bytes / (1024 * 1024) # In MB
# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])
Grafana (Dashboards)
Visualize Prometheus metrics.
Panel 1: CPU Usage Panel 2: Memory Usage
├─ Pod A: 45% ├─ Pod A: 450MB / 1GB
├─ Pod B: 52% ├─ Pod B: 380MB / 1GB
└─ Pod C: 38% └─ Pod C: 290MB / 1GB
Panel 3: Request Latency Panel 4: Error Rate
└─ 50th: 45ms └─ 0.5%
95th: 120ms
99th: 250ms
Alerting Rule:
groups:
- name: api_alerts
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} has high CPU > 80%"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 2m
annotations:
summary: "Error rate > 1%"
Loki (Log Aggregation)
Lightweight log aggregation (Prometheus-like for logs).
Architecture
Promtail (agent on nodes)
├─ Reads logs from files
├─ Adds labels
└─ Sends to Loki
├─ Stores compressed logs
└─ Queryable via Grafana
Configuration (promtail-config.yaml):
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace_name]
target_label: namespace
Querying (LogQL)
# All logs from api pod
{pod="api"}
# ERROR logs from api
{pod="api"} |= "ERROR"
# Latency distribution
{job="api"} | pattern `<_> <_> <_> <response_time:number>ms` | rate(__error__ [5m])
Jaeger (Distributed Tracing) Optional
Track request across microservices.
Example Trace:
Request ID: abc123
API Service (10ms)
├─ Receive request (1ms)
├─ Auth service call (3ms) ← Separate span
│ └─ Auth service (2ms)
├─ Database call (5ms) ← Separate span
│ └─ Postgres (5ms)
└─ Response (1ms)
Instrumentation:
from jaeger_client import Config
config = Config(
config={
'sampler': {'type': 'const', 'param': 1},
'local_agent': {'reporting_host': 'jaeger', 'reporting_port': 6831},
},
service_name='api-service',
)
tracer = config.initialize_tracer()
with tracer.start_active_span('process-request') as scope:
# Span automatically tracked
db_result = query_database()
Stack Deployment (Kubernetes)
# 1. Namespace
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
# 2. Prometheus ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes'
kubernetes_sd_configs:
- role: pod
---
# 3. Prometheus Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
volumes:
- name: config
configMap:
name: prometheus-config
---
# 4. Prometheus Service
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
---
# 5. Grafana Deployment (similar structure)
---
# 6. Loki Deployment (similar structure)
Common Dashboards
Node Metrics
- CPU usage per node
- Memory usage per node
- Disk usage
- Network I/O
Pod Metrics
- CPU per pod
- Memory per pod
- Restart count
- Network bytes in/out
Application Metrics
- Request rate
- Request latency (p50, p95, p99)
- Error rate
- Active connections
Anti-Patterns
❌ No Monitoring
"If nobody's looking, there's no problem"
✅ Instrument everything — metrics, logs, traces
❌ Too Many Alerts
"Alert fatigue" from 1000 triggered alerts
✅ Alert on symptoms, not noise
BAD: Alert if CPU > 60% (always firing)
GOOD: Alert if CPU > 90% for 10m (actionable)
❌ Not Using Logs
kubectl logs pod-name # One pod at a time
✅ Centralize logs — query across all pods
loki query: {namespace="production"} |= "ERROR"
Interview Questions
Q: What are the three pillars of observability?
A: Metrics (what), Logs (why), Traces (how). Together they provide complete system visibility.
Q: When would you use Prometheus vs. Loki?
A: Prometheus for numeric metrics (CPU, latency). Loki for logs (text events). Use together for complete picture.
Q: What's a good alert strategy?
A: Alert on symptoms (slow response time), not noise (high CPU). Should be actionable (not firing constantly).
Key Takeaways
✅ Metrics = quantitative measurement (Prometheus)
✅ Logs = discrete events (Loki)
✅ Traces = request path across services (Jaeger)
✅ Grafana visualizes Prometheus metrics
✅ Alerting = automated notification on thresholds
✅ Centralized observability beats grepping logs