Key Metrics & Measurements
Introduction
Performance metrics tell you how well your system is actually performing. Without metrics, you're flying blind.
"You can't improve what you don't measure." โ Peter Drucker
This section covers the metrics you'll track during load tests and what they mean.
Latency Metrics (Response Time)
Latency is the time from when a request is sent until a response is received.
โโ Client sends request (GET /api/users)
โ
โโ Network transit: 5ms
โโ Server processes: 100ms
โโ Network return: 5ms
โ
โโ Response received
Total latency = 110ms
The Problem with Mean Latency
Many teams track mean (average) latency, but it's misleading:
Request 1: 50ms
Request 2: 60ms
Request 3: 55ms
Request 4: 70ms
Request 5: 5000ms โ One slow request (cache miss, GC pause)
Request 6: 65ms
Request 7: 60ms
Request 8: 55ms
Request 9: 50ms
Request 10: 70ms
Mean latency = 5535ms รท 10 = 553.5ms โ Hiding outliers!
Only 1 request was slow, but the mean suggests widespread problems.
Percentile Latencies (Correct Approach)
Percentiles tell you the distribution of response times:
P50 (Median)
- Definition: 50% of requests complete faster than this time
- Example: p50 = 100ms means half your requests are faster than 100ms
- Use case: Baseline performance; less useful alone
P95 (95th Percentile)
- Definition: 95% of requests complete within this time; 5% are slower
- Example: p95 = 300ms means:
- Industry guideline:
- Web apps: target <500ms
- APIs: target <200ms
- Real-time systems: target <50ms
- User impact: Most users have good experience; 5% might notice slowness
P99 (99th Percentile)
- Definition: 99% of requests complete within this time; 1% are slower
- Example: p99 = 800ms means:
- Industry guideline:
- Web apps: target <1000ms
- APIs: target <500ms
- User impact: Occasional users experience significant slowness
P99.9 (99.9th Percentile)
- Definition: 99.9% of requests complete within this time; 0.1% are slower
- Example: p99.9 = 3000ms means:
- Use case: SLA compliance, extreme outliers
- User impact: Rare users experience very slow responses
Latency Example from Real Test
Test: 10,000 requests at constant load
โโ Percentile Distribution โโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Count
โ โ โฑโฒ
โ 1000 โ โฑ โฒ
โ โ โฑ โฒ___
โ 500 โ โฑ โฒ
โ โ โฑ โฒ
โ โโฑ_________________โฒ___
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Latency (ms)
โ 0 100 200 300 500 5000
โ
โโ Key metrics:
โ โโ Min: 45ms
โ โโ p50: 100ms (half faster than 100ms)
โ โโ p95: 300ms (95% faster than 300ms)
โ โโ p99: 1000ms (99% faster than 1000ms)
โ โโ p99.9: 3000ms (99.9% faster than 3000ms)
โ โโ Max: 5000ms
โ
โโ 5 requests took >1000ms (cache misses, GC pauses)
Why Percentiles Matter
Scenario: Your p95 = 1000ms (too high!)
Diagnosis options:
โ Option 1: "Let's use mean to decide"
Mean = 150ms (looks fine, misleading!)
โ Option 2: "Check p95, p99"
p95 = 1000ms (5% of users suffering)
p99 = 5000ms (1% experience 5-second waits)
Root cause: Slow database query on specific conditions
Fix: Add index, optimize query, increase cache TTL
Throughput Metrics
Throughput measures how much work the system completes.
RPS (Requests Per Second)
- Definition: How many HTTP requests the system processes per second
- Example: 1,000 RPS = system handles 1,000 requests/sec
- Measurement: Count successful requests in a 1-second window
- During load test:
TPS (Transactions Per Second)
- Definition: Number of complete business transactions per second
- Differs from RPS: One transaction might involve multiple requests
- Example: Checkout flow = 5 HTTP requests
- When to use: Business metrics, SLA reporting
Success Rate & Error Rate
Success Rate: Percentage of requests that succeeded (HTTP 2xx, 3xx)
Example from load test:
โโ Total requests: 10,000
โโ Successful: 9,950
โโ Failed: 50
โโ Success rate: 99.5%
โโ Error rate: 0.5%
SLA: Must be >99% success
Result: โ PASS (99.5% > 99%)
Error Rate: Percentage that failed (4xx, 5xx, timeouts)
Common failure modes:
โโ 4xx errors (4%): Client errors (bad requests)
โ โโ Often: Invalid data, auth failures
โโ 5xx errors (0.5%): Server errors
โ โโ Often: Database down, OOM, unhandled exceptions
โโ Timeouts (0.1%): Request never completed
โโ Often: Slow database, external service, queue buildup
Resource Metrics
While running a load test, monitor your system resources on the server being tested. These are usually captured with Datadog.
CPU Utilization
- Definition: Percentage of CPU time being used
- During load test:
- What causes high CPU?
- Complex calculations (encryption, compression)
- Inefficient algorithms
- Thread contention (locks, synchronized blocks)
- Garbage collection pauses (Java, Go, Python)
Memory Usage
- Definition: RAM consumed by the application
- Watch for:
- Common memory issues:
- Memory leaks (objects not released)
- Growing caches without eviction
- Connection pool leaks (connections not returned)
- Message queues filling up
Disk I/O
- Definition: Read/write operations to disk
- During load test:
- Watch for:
- High disk I/O โ indicates database queries not cached
- Disk saturation โ indicates storage bottleneck
- Example: SSD writes per second should be <10,000 (AWS limit varies)
Network I/O
- Definition: Bytes sent and received over network
- During load test:
- Common issue:
Connection Pools
- Definition: Active database/service connections
- During load test:
- Problem: Connection pool exhaustion
How Metrics Work Together
Example 1: System Performing Well
Load test: 5,000 RPS for 10 minutes
Metrics during test:
โโ p95 latency: 200ms โ (target: <300ms)
โโ p99 latency: 400ms โ (target: <1000ms)
โโ Success rate: 99.8% โ (target: >99%)
โโ CPU: 65% โ (healthy, room to grow)
โโ Memory: 1.2GB โ (stable, not growing)
โโ Database connections: 30/50 โ (headroom)
โโ Disk I/O: Low โ (queries cached)
Conclusion: โ
SYSTEM HEALTHY
Action: Can handle 2-3x more load safely
Example 2: Database Bottleneck
Load test: Increasing from 1,000 to 10,000 RPS
Observations:
โโ 1,000 RPS: p95 latency = 50ms, CPU 20%
โโ 2,000 RPS: p95 latency = 75ms, CPU 25%
โโ 5,000 RPS: p95 latency = 300ms, CPU 40%, disk I/O โฌ โฌ โฌ
โโ 10,000 RPS: p95 latency = 2000ms, CPU 50%, disk maxed out
Root cause: Disk I/O ceiling
โ
โโ Database queries hitting disk (not in cache)
โโ Each query = disk I/O wait
โโ As load increases, more queries queue up
โโ Latency explodes non-linearly
Fix options:
โโ Add caching layer (Redis)
โโ Optimize slow queries (add indexes)
โโ Increase database connection pool
โโ Scale database (read replicas)
โโ Re-test after fix
Example 3: GC Pause Impact
Java application under load
JVM GC (Garbage Collection) timeline:
โโ Time 0-45s: Normal operations
โ โโ p95 latency: 100ms, steady
โ
โโ Time 45s: Full GC pause (0.5 seconds)
โ โโ All requests block
โ โโ p95 latency spikes to 2000ms+
โ โโ Requests timeout, 0.1% errors
โ
โโ Time 45.5s: GC completes
โ โโ Requests resume
โ
โโ Time 45.5-50s: High p99 tail
โ โโ Requests queued during GC still processing
โ โโ Takes 2-3 seconds to drain queue
โ
โโ Time 50+s: Back to normal
Observation: Every 45 seconds, p95 spikes to 2000ms
Fix options:
โโ Tune JVM heap size (-Xmx, -Xms)
โโ Change GC algorithm (G1GC, ZGC for low latency)
โโ Add more memory to reduce GC frequency
โโ Re-test after tuning
Metrics by Load Test Type
Load Test (Baseline)
Monitor these:
โโ p50, p95, p99 latencies โ Should be stable
โโ RPS / TPS โ Should be consistent
โโ Success rate โ Should be >99%
โโ CPU/Memory โ Should be steady
โโ Resource utilization โ Should not be maxing out
Stress Test (Breaking Point)
Monitor:
โโ p95, p99, p99.9 latencies โ Will increase
โโ Error rate โ Should increase as you approach limit
โโ RPS plateau โ Where does it max out?
โโ CPU/Memory peaks โ What triggers saturation?
โโ Recovery time โ How long to stabilize?
Soak Test (Long-term Stability)
Monitor over 8-24 hours:
โโ Memory trend โ Growing (leak) or stable?
โโ p95 latency trend โ Increasing (degradation) or stable?
โโ Connection count โ Growing (leak) or stable?
โโ GC pause frequency โ Increasing (more pressure)?
โโ Error rate trend โ Any anomalies over time?
Spike Test (Recovery)
Monitor during spike and recovery:
โโ Before spike: p95 = 50ms
โโ During spike: p95 = 1000ms (accept 20x increase)
โโ After spike: p95 โ 50ms (should recover within 2-3 minutes)
โโ System availability โ Did it stay up?
Target Metrics by Service Type
| Service | p95 Target | p99 Target | Success Rate |
|---|---|---|---|
| Web App | <300ms | <1000ms | >99.5% |
| Mobile API | <200ms | <500ms | >99.5% |
| Internal API | <100ms | <300ms | >99.9% |
| Real-time | <50ms | <100ms | >99.9% |
| Batch/Event | <5000ms | <30000ms | >99% |
| Kafka Stream | <100ms (produce) | <500ms | >99.9% |
Tools for Collecting Metrics
Gatling Built-in
Metrics are automatically collected:
โโ Response times (min, max, percentiles)
โโ Success/error rates
โโ Request counts by endpoint
โโ HTML report with charts
Datadog APM (Recommended for Production)
โโ Real-time metrics
โโ Trace-level detail (which database query was slow?)
โโ Custom metrics and annotations
โโ Alert thresholds
โโ Dashboard queries
Application Monitoring
Next Steps
โ Read next: Load Testing Methodology - How to plan and execute tests