Performance Targets & SLAs

What is an SLA?

An SLA (Service Level Agreement) is a commitment about how your system will perform.

Example SLA for e-commerce API:

"The system will respond to 99.9% of requests within 500ms p95 latency
during normal business hours (8am-11pm). Monthly uptime must exceed 99.95%."

Components:
├─ Success rate: 99.9% of requests succeed
├─ Latency target: 500ms at p95 percentile
├─ Window: Normal business hours
└─ Uptime: 99.95% monthly availability

Why SLAs Matter

  • For customers: Clear expectations about service quality
  • For engineers: Measurable target to design for
  • For ops: Alerting threshold if SLA breaches
  • For business: Credibility, compliance, penalties if missed

Industry-Standard Targets

By Service Type

Web Applications (User-Facing)

Target SLAs:
├─ p95 latency: <300ms
├─ p99 latency: <1000ms
├─ p99.9 latency: <3000ms
├─ Success rate: >99.5%
└─ Uptime: >99.5% (43 minutes downtime/month acceptable)

Examples:
├─ E-commerce checkout
├─ Social media feed
├─ Email inbox
└─ Video streaming

Mobile APIs

Target SLAs:
├─ p95 latency: <200ms
├─ p99 latency: <500ms
├─ p99.9 latency: <2000ms
├─ Success rate: >99.5%
└─ Uptime: >99.9% (4.3 minutes downtime/month)

Why stricter than web?
├─ Users on cellular (high latency sensitivity)
├─ Battery impact (slow response drains battery)
├─ UX expectations (instant response)
└─ Often critical workflows (payments, health)

Internal/Backend APIs

Target SLAs:
├─ p95 latency: <100ms
├─ p99 latency: <300ms
├─ p99.9 latency: <1000ms
├─ Success rate: >99.9%
└─ Uptime: >99.99% (26 seconds downtime/month)

Why stricter?
├─ Used by other critical services
├─ Cascading failures if slow
├─ Should be faster than public API
├─ Architectural assumption: internal calls are fast

Real-Time Systems (Trading, Gaming)

Target SLAs:
├─ p95 latency: <50ms
├─ p99 latency: <100ms
├─ p99.9 latency: <200ms
├─ Success rate: >99.99%
└─ Uptime: >99.99%

Why ultra-strict?
├─ Milliseconds = real financial impact
├─ Player experience depends on < 100ms latency
├─ Regulatory requirements (trading)
└─ Competitive disadvantage if slower

Batch/Background Processing

Target SLAs:
├─ Latency: <5-30 seconds (entire job)
├─ p99 latency: <30000ms
├─ Success rate: >99% (some retry OK)
└─ Uptime: >99%

Why more relaxed?
├─ User not waiting for response
├─ Can retry later if failed
├─ Throughput matters more than latency
├─ 1-2 minute delay is acceptable

Kafka/Message Streaming

Target SLAs:
├─ Producer latency (send to Kafka): <100ms p95
├─ Consumer lag (behind producer): <1000ms
├─ Success rate: >99.9% (messages not lost)
└─ Uptime: >99.99%

Why different metrics?
├─ Producer latency: How fast can you send messages
├─ Consumer lag: How far behind is processing
├─ Throughput: Messages per second matters most

Setting Your Own SLAs

Step 1: Understand Your Users

Question: What do YOUR users tolerate?

Research methods:
├─ Google Analytics: Bounce rate vs page load time
├─ User surveys: "Is 500ms acceptable?"
├─ Competitor analysis: What do others target?
└─ Industry data: Benchmarks for your domain

Google data:
├─ <1 second: Excellent experience
├─ 1-3 seconds: Acceptable
├─ >3 seconds: 40% bounce rate
├─ >5 seconds: 75% bounce rate

Step 2: Understand Your Constraints

Question: What can YOUR system realistically achieve?

Assessment:
├─ Current latency (baseline test now)
├─ Current infrastructure limits
├─ Team capability (can we optimize?)
├─ Timeline (when do we need to hit target?)
└─ Budget (can we scale infrastructure?)

Example:
├─ Current p95: 800ms (too high)
├─ Target p95: 300ms (ambitious)
├─ Gap: 500ms improvement needed
├─ Achievable via: Database optimization, caching, scaling
├─ Timeline: 3 months (reasonable for most improvements)

Step 3: Define SLA Components

Template:

Service: [Name]

Response Time:
├─ p95: [target]ms (95% of requests faster than this)
├─ p99: [target]ms (99% of requests faster than this)
└─ p99.9: [target]ms (for extreme cases)

Success Rate:
├─ Goal: >[percentage]% success
├─ Acceptable error types: [4xx, 5xx, timeout]
└─ Unacceptable: [data corruption, cascading failures]

Availability/Uptime:
├─ Target: [percentage]% uptime
├─ Calculation: (total_time - downtime) / total_time
├─ Allowable downtime: [X minutes per month]
└─ Excludes: Planned maintenance windows

Conditions:
├─ Time window: [peak hours, business hours, 24/7]
├─ Load assumption: [X concurrent users, Y RPS]
└─ Geography: [specific regions or global]

Example SLA:

Service: User Profile API

Response Time:
├─ p95: 200ms
├─ p99: 500ms
└─ p99.9: 1000ms

Success Rate:
├─ Goal: 99.9%
└─ Errors: <0.1%

Availability:
├─ Target: 99.95% uptime
└─ Allows: 22 minutes downtime/month

Conditions:
├─ Peak hours: 8am-11pm PT
├─ Load: 5000 RPS
└─ Regions: US-East primary, US-West for failover

Uptime Percentages Explained

Percentage Downtime per Year Downtime per Month
99% 87.6 hours 7.2 hours
99.5% 43.8 hours 3.6 hours
99.9% 8.76 hours 43 minutes
99.95% 4.38 hours 22 minutes
99.99% 52.56 minutes 4.3 minutes
99.999% 5.26 minutes 26 seconds

Measurement & Validation

How to Measure Against SLAs

During load test:

1. Run test to expected load for 15-30 minutes
2. Measure all metrics
3. Compare against SLA targets

Example:

SLA Target: p95 <500ms, success >99%
Test Result: p95 = 480ms, success = 99.2%

✅ PASS: Both metrics met

When to Fail SLA

SLA Target: p95 <300ms

Test Result: p95 = 350ms
├─ 50ms over target
├─ Is this acceptable?
│  ├─ Marginal failure (close)
│  ├─ Worth investigating: Database slow by 50ms?
│  └─ Decision: Fix or adjust SLA
└─ Action: Optimize and re-test

Test Result: p95 = 1000ms
├─ 700ms over target
├─ Unacceptable
├─ Root cause: Database bottleneck (found via Datadog)
└─ Action: Fix bottleneck, re-test, retry SLA validation

Alerting on SLA Breaches

Set up Datadog alerts to notify when SLA breaches:

Example Alert:

Name: p95 latency exceeded SLA
Condition: avg:trace.web.request.duration{service:my-api}.percentile(95) > 500

Actions on breach:
├─ Send to Slack: "p95 latency is #{ value | round 0 }}ms (SLA: 500ms)"
├─ PagerDuty: "Page on-call engineer"
└─ Auto-remediation: "Scale service if CPU > 80%"

SLA vs Percentile Confusion

Many teams get confused about what their SLA really means.

❌ Bad SLA Definition

"Response time SLA: 500ms"
├─ Ambiguous: Mean? p95? p99?
├─ Confusing: Does 1% slow responses break SLA?
└─ Problem: Can't measure compliance

✅ Good SLA Definition

"Response time SLA: p95 < 500ms"
├─ Clear: 95th percentile latency
├─ Measurable: Easy to verify in Gatling/Datadog
├─ Acceptable: Allows 5% of requests to be slower
└─ Problem: If p95 = 501ms, SLA fails (strict boundary)

✅ Better SLA Definition

"Response time SLA: p95 < 500ms ± 10ms"
├─ Clear: 95th percentile, 500ms target
├─ Allowable variance: ±10ms (501-509ms is OK)
├─ Practical: Accounts for noise, natural variance
└─ Measurement: 510ms+ is considered breach

Dynamic SLAs

Some teams set different SLAs for different scenarios:

E-commerce search API:

Normal hours (10am-9pm):
├─ p95: 300ms
├─ p99: 800ms
└─ Success: >99.5%

Peak hours (Black Friday):
├─ p95: 500ms (relaxed, more users)
├─ p99: 1500ms
└─ Success: >99% (0.1% errors acceptable)

Off-peak (midnight-10am):
├─ p95: 200ms (strict, should be fast when not busy)
├─ p99: 500ms
└─ Success: >99.9%

Rationale:
├─ Peak hours: Users tolerate slower responses
├─ Off-peak: System has headroom, should be faster
├─ Normal: Balanced targets

Common SLA Mistakes

❌ Mistake 1: Mean Latency in SLA

Bad: "Average response time < 500ms"
├─ Hides outliers
├─ 1 slow request can pull up average
└─ Misleading metric

Better: "p95 latency < 500ms"
├─ Guarantees 95% of users have acceptable experience
└─ Doesn't hide the 5% who suffer

❌ Mistake 2: SLA Too Strict

Bad: "p99.99 latency < 100ms"
├─ Nearly impossible to achieve
├─ Costs 10x infrastructure
└─ Team burns out trying to hit it

Better: "p99 latency < 500ms"
├─ Achievable with good engineering
├─ Cost-effective
└─ Realistic targets

❌ Mistake 3: SLA Too Loose

Bad: "p95 latency < 10000ms"
├─ Users hate 10-second waits
├─ No incentive to optimize
└─ Business impact: Users leave

Better: "p95 latency < 300ms"
├─ Users happy
├─ Engineering incentive: Optimize
└─ Business aligned

❌ Mistake 4: Ignoring p99

Bad: Monitor only p95, ignore p99
├─ 5% of users experience terrible latency
├─ No visibility into tail behavior
└─ Surprised by outlier complaints

Better: Monitor p95, p99, and p99.9
├─ See full distribution
├─ Catch tail issues early
└─ Comprehensive understanding

Testing to Verify SLAs

Load Test Checklist

Before running test:
☐ SLAs defined (p95, p99, success rate)
☐ Test duration planned (15-30 min for baseline)
☐ Load level chosen (expected + 2-3x headroom)
☐ Assertions configured in Gatling
☐ Datadog monitoring enabled
☐ Team notified

Running test:
☐ Smoke test passes (1 user)
☐ System stable during ramp-up
☐ Metrics stable during hold phase
☐ No cascading failures observed
☐ Error rate stays acceptable

After test:
☐ p95 vs SLA target ✓ or ✗
☐ p99 vs SLA target ✓ or ✗
☐ Success rate vs SLA target ✓ or ✗
☐ Uptime: No downtime during test ✓
☐ Root cause analysis if failed
└─ Fix, optimize, re-test

Kafka-Specific SLA Targets

Service Type p95 Target p99 Target p99.9 Target
User-Facing Web <300ms <1000ms <3000ms
Mobile App API <200ms <500ms <2000ms
Internal Service <100ms <300ms <1000ms
Batch Processing <5000ms <30000ms N/A
Real-time Analytics <1000ms <5000ms <30000ms
Kafka Produce <100ms <500ms
Kafka Consumer Lag <500ms per msg

Sample Datadog Monitors

High p99 Latency Alert

{
  "name": "High p99 Latency Alert",
  "type": "metric alert",
  "query": "avg:trace.web.request.duration{service:my-api,resource_name:/api/users}.percentile(99) > 1000",
  "threshold": 1000,
  "alert_message": "p99 latency exceeded 1 second"
}

Kafka Consumer Lag Alert

{
  "name": "Kafka Consumer Lag Alert",
  "type": "metric alert",
  "query": "avg:kafka.consumer_group.lag{group:my-consumer-group} > 10000",
  "threshold": 10000,
  "alert_message": "Consumer lag exceeded 10k messages"
}

← Previous: Open Load Patterns
→ Next: Common Pitfalls
↑ Up: Documentation Index