Metrics — Deep Dive

Level: Intermediate
Pre-reading: 07 · Observability


Metric Types

Type Description Example
Counter Monotonically increasing value Total HTTP requests
Gauge Current snapshot value Active connections
Histogram Distribution of values Request latency buckets
Summary Pre-calculated quantiles p50, p95, p99 latency

RED Method (Service Health)

Metric What It Measures Query
Rate Requests per second rate(http_requests_total[5m])
Errors Error rate (%) rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Duration Latency distribution histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE Method (Resource Health)

Metric What It Measures
Utilization How busy (%)
Saturation Queued work
Errors Resource errors

Prometheus Metrics

@Component
public class OrderMetrics {

    private final Counter ordersPlaced = Counter.build()
        .name("orders_placed_total")
        .help("Total orders placed")
        .labelNames("status", "payment_method")
        .register();

    private final Histogram orderLatency = Histogram.build()
        .name("order_processing_seconds")
        .help("Order processing time")
        .buckets(0.1, 0.25, 0.5, 1, 2.5, 5, 10)
        .register();

    public void recordOrder(String status, String paymentMethod, double duration) {
        ordersPlaced.labels(status, paymentMethod).inc();
        orderLatency.observe(duration);
    }
}

Micrometer (Spring Boot)

@Service
public class OrderService {

    private final MeterRegistry meterRegistry;
    private final Timer orderTimer;

    public OrderService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        this.orderTimer = Timer.builder("order.processing")
            .description("Order processing time")
            .register(meterRegistry);
    }

    public Order placeOrder(OrderRequest request) {
        return orderTimer.record(() -> {
            // Process order
            return processOrder(request);
        });
    }
}

Key Metrics to Track

Category Metrics
HTTP Request rate, error rate, latency (p50, p95, p99)
Database Connection pool size, query time, errors
JVM Heap usage, GC time, thread count
Custom Business metrics (orders/min, revenue)

Alerting Rules

groups:
  - name: order-service
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning

Counter vs Gauge — when to use which?

Counter for cumulative values that only increase (requests, errors). Reset on restart. Gauge for values that go up and down (connections, temperature). Use rate() with counters to get per-second rate.

Why use histograms over summaries?

Histograms can be aggregated across instances; summaries cannot. Histograms let you calculate percentiles server-side with histogram_quantile(). Use histograms for most cases; summaries only when you need exact quantiles on a single instance.

What percentiles should you track for latency?

At minimum: p50 (median), p95, p99. p99 shows worst-case for most users. Consider p99.9 for critical paths. Don't rely on averages — they hide outliers.