SLI, SLO, SLA — Deep Dive

Level: Intermediate
Pre-reading: 07 · Observability


Definitions

Term Definition Example
SLI (Service Level Indicator) The metric you measure successful_requests / total_requests
SLO (Service Level Objective) Target value for the SLI 99.9% availability
SLA (Service Level Agreement) Contractual commitment 99.5% or 10% refund
graph LR
    SLI[SLI: What we measure] --> SLO[SLO: What we target]
    SLO --> SLA[SLA: What we promise]

Error Budget

Error budget = 100% - SLO target

SLO Error Budget Downtime/Month
99.9% 0.1% ~43 minutes
99.95% 0.05% ~22 minutes
99.99% 0.01% ~4 minutes

Error Budget Policy

"If we're burning error budget too fast, we freeze feature deployments and focus on reliability."


Common SLIs

Category SLI Measurement
Availability Success rate successful_requests / total_requests
Latency Response time requests < 500ms / total_requests
Throughput Capacity orders_processed / time
Correctness Data accuracy correct_responses / total_responses

Setting SLOs

Factor Consideration
User expectations What do users consider acceptable?
Dependencies Your SLO can't exceed dependency SLOs
Cost Higher reliability = higher cost
Current performance What can you realistically achieve?

SLO Example

slos:
  order-service:
    - name: availability
      target: 99.9
      window: 30d
      sli:
        type: success_rate
        metric: http_requests_total
        good: status=~"2.."

    - name: latency
      target: 95
      window: 30d
      sli:
        type: latency_percentile
        percentile: 99
        threshold: 500ms

SLO Burn Rate Alerting

Instead of alerting on instant failures, alert on burn rate — how fast you're consuming error budget.

Burn Rate Meaning Alert Window
14.4x Budget consumed in 2 hours 5 min window
6x Budget consumed in 12 hours 30 min window
1x Consuming at expected rate 6 hour window

Prometheus Alert

- alert: SLOBurnRateHigh
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) 
      / rate(http_requests_total[5m])
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical

SLO Dashboard

Panel Content
Current SLI Real-time success rate, latency
Error budget remaining % of budget left
Burn rate Current consumption rate
Historical SLI over 30 days

What's the difference between SLO and SLA?

SLO is an internal target — what you aim for. SLA is an external commitment — what you promise customers with penalties. SLO should be tighter than SLA to give buffer. Example: SLO = 99.9%, SLA = 99.5%.

How do you set an appropriate SLO?

(1) Measure current performance — can't promise what you can't deliver. (2) Understand user expectations — survey, analyze support tickets. (3) Consider dependencies — you can't exceed their SLOs. (4) Balance cost vs reliability — 99.99% is 10x harder than 99.9%. Start conservative; tighten as you improve.

What is error budget and how does it drive decisions?

Error budget is the allowed unreliability (100% - SLO). If SLO is 99.9%, budget is 0.1% (~43 min/month). When budget is healthy: deploy features. When budget is depleting: freeze deployments, focus on reliability. It aligns product and SRE teams around shared goals.