Metrics — Deep Dive
Level: Intermediate
Pre-reading: 07 · Observability
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value | Total HTTP requests |
| Gauge | Current snapshot value | Active connections |
| Histogram | Distribution of values | Request latency buckets |
| Summary | Pre-calculated quantiles | p50, p95, p99 latency |
RED Method (Service Health)
| Metric | What It Measures | Query |
|---|---|---|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Error rate (%) | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) |
| Duration | Latency distribution | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
USE Method (Resource Health)
| Metric | What It Measures |
|---|---|
| Utilization | How busy (%) |
| Saturation | Queued work |
| Errors | Resource errors |
Prometheus Metrics
@Component
public class OrderMetrics {
private final Counter ordersPlaced = Counter.build()
.name("orders_placed_total")
.help("Total orders placed")
.labelNames("status", "payment_method")
.register();
private final Histogram orderLatency = Histogram.build()
.name("order_processing_seconds")
.help("Order processing time")
.buckets(0.1, 0.25, 0.5, 1, 2.5, 5, 10)
.register();
public void recordOrder(String status, String paymentMethod, double duration) {
ordersPlaced.labels(status, paymentMethod).inc();
orderLatency.observe(duration);
}
}
Micrometer (Spring Boot)
@Service
public class OrderService {
private final MeterRegistry meterRegistry;
private final Timer orderTimer;
public OrderService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.orderTimer = Timer.builder("order.processing")
.description("Order processing time")
.register(meterRegistry);
}
public Order placeOrder(OrderRequest request) {
return orderTimer.record(() -> {
// Process order
return processOrder(request);
});
}
}
Key Metrics to Track
| Category | Metrics |
|---|---|
| HTTP | Request rate, error rate, latency (p50, p95, p99) |
| Database | Connection pool size, query time, errors |
| JVM | Heap usage, GC time, thread count |
| Custom | Business metrics (orders/min, revenue) |
Alerting Rules
groups:
- name: order-service
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
Counter vs Gauge — when to use which?
Counter for cumulative values that only increase (requests, errors). Reset on restart. Gauge for values that go up and down (connections, temperature). Use rate() with counters to get per-second rate.
Why use histograms over summaries?
Histograms can be aggregated across instances; summaries cannot. Histograms let you calculate percentiles server-side with histogram_quantile(). Use histograms for most cases; summaries only when you need exact quantiles on a single instance.
What percentiles should you track for latency?
At minimum: p50 (median), p95, p99. p99 shows worst-case for most users. Consider p99.9 for critical paths. Don't rely on averages — they hide outliers.