Observability & Debugging — Microservices Interview
Target: Senior Engineer · Engineering Lead · Pre-Architect Focus: Centralized logging, distributed tracing, metrics, SLOs
Q: How do you implement centralized logging across dozens of microservices?
Why interviewers ask this: Distributed logging is foundational to debugging. Tests understanding of log aggregation, correlation, and searchability at scale.
Answer
Logging stack:
Services (stdout)
↓ (JSON structured logs)
Filebeat/Logstash (collect & parse)
↓
Elasticsearch (index & store)
↓
Kibana (search & visualize)
Spring Boot structured logging:
@Slf4j
@RestController
public class OrderController {
@PostMapping("/orders")
public Order createOrder(@RequestBody OrderRequest req) {
String traceId = MDC.get("traceId");
String spanId = MDC.get("spanId");
log.info("order.created",
kv("traceId", traceId),
kv("spanId", spanId),
kv("customerId", req.getCustomerId()),
kv("amount", req.getAmount())
);
return orderService.create(req);
}
}
Logback config with JSON:
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<fieldNames>
<timestamp>@timestamp</timestamp>
<version>@version</version>
<message>message</message>
<loggerName>logger_name</loggerName>
</fieldNames>
</encoder>
</appender>
Elasticsearch query:
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{"match": {"traceId": "123abc"}},
{"match": {"level": "ERROR"}}
]
}
}
}
Architect Insight
Include correlation IDs (trace ID) in every log. This is the single most important thing for distributed debugging.
Q: How do you implement distributed tracing across service boundaries?
Answer
Trace propagation:
Client Request
↓ Generate traceId="abc123", spanId="span1"
↓ Pass in header: X-Trace-ID: abc123, X-Span-ID: span1
Service A
↓ Extract headers, create child span
↓ Call Service B with parent spanId
Service B
↓ Extract headers, add to new span
↓ Response includes spanId
Service A
↓ Collect all spans, send to collector (Jaeger)
Jaeger
↓ Visualize trace flow and latency breakdown
Spring Cloud Sleuth + Jaeger:
// Pom.xml
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-sleuth</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-sleuth-otel-exporter-jaeger</artifactId>
</dependency>
Headers automatically added:
Sampling strategy:
spring:
sleuth:
sampler:
probability: 0.1 # Sample 10% of traces (1% in prod)
rate: 100 # Or sample 100 traces/sec
Common Mistake
Sample everything in dev (probability: 1.0), but only 1-10% in production or you'll overwhelm Jaeger with data.
Q: What metrics should you monitor for microservices health?
Answer
Golden Signals (essential):
| Signal | Target | Alert If |
|---|---|---|
| Latency | p50 < 200ms, p99 < 1s | p99 > 2s for 5+ min |
| Errors | < 0.1% error rate | Error rate > 1% |
| Saturation | CPU < 70% | CPU > 85% or memory > 90% |
| Traffic | Monitor ramp | Unexpected drop (outage) or spike |
Spring Boot + Prometheus:
@Bean
public MeterBinder orderMetrics(OrderRepository repo) {
return (registry) -> {
Gauge.builder("orders.total", repo, OrderRepository::count)
.description("Total orders in system")
.register(registry);
Timer.builder("order.create.time")
.description("Time to create order")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
};
}
Prometheus scrape config:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'order-service'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/actuator/prometheus'
Alerting rules:
groups:
- name: microservices
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
annotations:
summary: "Error rate > 1% for {{ $labels.service }}"
- alert: PodRestartingTooOften
expr: rate(container_last_seen[5m]) > 0.1
annotations:
summary: "Pod {{ $labels.pod }} restarting frequently"
Q: How do you implement effective alerting without alert fatigue?
Answer
Alert design principles:
Good alerts:
- Based on SLO (Service Level Objective)
- Actionable (on-call knows what to do)
- Signal-to-noise > 90% (low false positives)
Bad alerts:
- CPU > 50% (constantly triggers)
- Page 20 alerts per hour (meaningless)
- "Something is wrong" with no context
SLO-based alerting:
SLO: 99.9% availability = 99.9% requests succeed
= 0.001 error budget per day
Alert if:
- Error rate > 1% for 5+ minutes (burns budget too fast)
- Error rate > 0.5% for 1+ minute (indicates trend)
Don't alert on:
- Error rate = 0% for 1 second (noise)
- CPU = 60% (arbitrary threshold)
Severity levels:
CRITICAL (page on-call):
- Service completely down
- Error rate > 5%
- Database unavailable
WARNING (create ticket):
- Error rate 1-5%
- Latency p99 > 5s
- Disk usage > 80%
INFO (log only):
- Warnings
- Deprecation notices
Diagram — Observability Stack
graph LR
Svc["Microservices\n· Logs\n· Metrics\n· Traces"]
Collect["Collection\n· Filebeat\n· OpenTelemetry"]
Store["Storage\n· Elasticsearch\n· Prometheus\n· Jaeger"]
Query["Query & Alert\n· Kibana\n· Grafana\n· AlertManager"]
OnCall["On-Call\n· Page engineer\n· Context provided"]
Svc -->|Emit| Collect
Collect -->|Send| Store
Store -->|Query| Query
Query -->|Alert| OnCall
style Collect fill:#4ecdc4
style Store fill:#51cf66
style Query fill:#ffe066