SLI, SLO, SLA — Deep Dive

Level: Intermediate
Pre-reading: 07 · Observability

Definitions

Term	Definition	Example
SLI (Service Level Indicator)	The metric you measure	`successful_requests / total_requests`
SLO (Service Level Objective)	Target value for the SLI	99.9% availability
SLA (Service Level Agreement)	Contractual commitment	99.5% or 10% refund

graph LR
    SLI[SLI: What we measure] --> SLO[SLO: What we target]
    SLO --> SLA[SLA: What we promise]

Error Budget

Error budget = 100% - SLO target

SLO	Error Budget	Downtime/Month
99.9%	0.1%	~43 minutes
99.95%	0.05%	~22 minutes
99.99%	0.01%	~4 minutes

Error Budget Policy

"If we're burning error budget too fast, we freeze feature deployments and focus on reliability."

Common SLIs

Category	SLI	Measurement
Availability	Success rate	`successful_requests / total_requests`
Latency	Response time	`requests < 500ms / total_requests`
Throughput	Capacity	`orders_processed / time`
Correctness	Data accuracy	`correct_responses / total_responses`

Setting SLOs

Factor	Consideration
User expectations	What do users consider acceptable?
Dependencies	Your SLO can't exceed dependency SLOs
Cost	Higher reliability = higher cost
Current performance	What can you realistically achieve?

SLO Example

slos:
  order-service:
    - name: availability
      target: 99.9
      window: 30d
      sli:
        type: success_rate
        metric: http_requests_total
        good: status=~"2.."

    - name: latency
      target: 95
      window: 30d
      sli:
        type: latency_percentile
        percentile: 99
        threshold: 500ms

SLO Burn Rate Alerting

Instead of alerting on instant failures, alert on burn rate — how fast you're consuming error budget.

Burn Rate	Meaning	Alert Window
14.4x	Budget consumed in 2 hours	5 min window
6x	Budget consumed in 12 hours	30 min window
1x	Consuming at expected rate	6 hour window

Prometheus Alert

- alert: SLOBurnRateHigh
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) 
      / rate(http_requests_total[5m])
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical

SLO Dashboard

Panel	Content
Current SLI	Real-time success rate, latency
Error budget remaining	% of budget left
Burn rate	Current consumption rate
Historical	SLI over 30 days

What's the difference between SLO and SLA?

SLO is an internal target — what you aim for. SLA is an external commitment — what you promise customers with penalties. SLO should be tighter than SLA to give buffer. Example: SLO = 99.9%, SLA = 99.5%.

How do you set an appropriate SLO?

(1) Measure current performance — can't promise what you can't deliver. (2) Understand user expectations — survey, analyze support tickets. (3) Consider dependencies — you can't exceed their SLOs. (4) Balance cost vs reliability — 99.99% is 10x harder than 99.9%. Start conservative; tighten as you improve.

What is error budget and how does it drive decisions?

Error budget is the allowed unreliability (100% - SLO). If SLO is 99.9%, budget is 0.1% (~43 min/month). When budget is healthy: deploy features. When budget is depleting: freeze deployments, focus on reliability. It aligns product and SRE teams around shared goals.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search