Key Metrics & Measurements

Introduction

Performance metrics tell you how well your system is actually performing. Without metrics, you're flying blind.

"You can't improve what you don't measure." โ€” Peter Drucker

This section covers the metrics you'll track during load tests and what they mean.


Latency Metrics (Response Time)

Latency is the time from when a request is sent until a response is received.

โ”Œโ”€ Client sends request (GET /api/users)
โ”‚  
โ”œโ”€ Network transit: 5ms
โ”œโ”€ Server processes: 100ms
โ”œโ”€ Network return: 5ms
โ”‚
โ””โ”€ Response received
   Total latency = 110ms

The Problem with Mean Latency

Many teams track mean (average) latency, but it's misleading:

Request 1:  50ms
Request 2:  60ms
Request 3:  55ms
Request 4:  70ms
Request 5:  5000ms โ† One slow request (cache miss, GC pause)
Request 6:  65ms
Request 7:  60ms
Request 8:  55ms
Request 9:  50ms
Request 10: 70ms

Mean latency = 5535ms รท 10 = 553.5ms โ† Hiding outliers!

Only 1 request was slow, but the mean suggests widespread problems.

Percentile Latencies (Correct Approach)

Percentiles tell you the distribution of response times:

P50 (Median)

  • Definition: 50% of requests complete faster than this time
  • Example: p50 = 100ms means half your requests are faster than 100ms
  • Use case: Baseline performance; less useful alone

P95 (95th Percentile)

  • Definition: 95% of requests complete within this time; 5% are slower
  • Example: p95 = 300ms means:
    โœ“ 9,500 out of 10,000 requests complete in โ‰ค300ms
    โœ— 500 out of 10,000 requests take >300ms
    
  • Industry guideline:
  • Web apps: target <500ms
  • APIs: target <200ms
  • Real-time systems: target <50ms
  • User impact: Most users have good experience; 5% might notice slowness

P99 (99th Percentile)

  • Definition: 99% of requests complete within this time; 1% are slower
  • Example: p99 = 800ms means:
    โœ“ 9,900 out of 10,000 requests complete in โ‰ค800ms
    โœ— 100 out of 10,000 requests take >800ms
    
  • Industry guideline:
  • Web apps: target <1000ms
  • APIs: target <500ms
  • User impact: Occasional users experience significant slowness

P99.9 (99.9th Percentile)

  • Definition: 99.9% of requests complete within this time; 0.1% are slower
  • Example: p99.9 = 3000ms means:
    โœ“ 9,990 out of 10,000 requests complete in โ‰ค3000ms
    โœ— 10 out of 10,000 requests take >3000ms
    
  • Use case: SLA compliance, extreme outliers
  • User impact: Rare users experience very slow responses

Latency Example from Real Test

Test: 10,000 requests at constant load

โ”Œโ”€ Percentile Distribution โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚
โ”‚    Count
โ”‚      โ”‚     โ•ฑโ•ฒ
โ”‚ 1000 โ”‚    โ•ฑ  โ•ฒ
โ”‚      โ”‚   โ•ฑ    โ•ฒ___
โ”‚  500 โ”‚  โ•ฑ          โ•ฒ
โ”‚      โ”‚ โ•ฑ             โ•ฒ
โ”‚      โ”‚โ•ฑ_________________โ•ฒ___
โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Latency (ms)
โ”‚      0    100   200   300   500   5000
โ”‚
โ”œโ”€ Key metrics:
โ”‚  โ”œโ”€ Min:  45ms
โ”‚  โ”œโ”€ p50:  100ms  (half faster than 100ms)
โ”‚  โ”œโ”€ p95:  300ms  (95% faster than 300ms)
โ”‚  โ”œโ”€ p99:  1000ms (99% faster than 1000ms)
โ”‚  โ”œโ”€ p99.9: 3000ms (99.9% faster than 3000ms)
โ”‚  โ””โ”€ Max:  5000ms
โ”‚
โ””โ”€ 5 requests took >1000ms (cache misses, GC pauses)

Why Percentiles Matter

Scenario: Your p95 = 1000ms (too high!)

Diagnosis options:
โŒ Option 1: "Let's use mean to decide"
   Mean = 150ms (looks fine, misleading!)

โœ“ Option 2: "Check p95, p99"
   p95 = 1000ms (5% of users suffering)
   p99 = 5000ms (1% experience 5-second waits)
   Root cause: Slow database query on specific conditions
   Fix: Add index, optimize query, increase cache TTL

Throughput Metrics

Throughput measures how much work the system completes.

RPS (Requests Per Second)

  • Definition: How many HTTP requests the system processes per second
  • Example: 1,000 RPS = system handles 1,000 requests/sec
  • Measurement: Count successful requests in a 1-second window
  • During load test:
    Second 1: 1,000 requests processed โœ“ RPS = 1,000
    Second 2: 950 requests processed โš  RPS = 950 (degrading under load)
    Second 3: 500 requests processed โŒ RPS = 500 (system saturating)
    

TPS (Transactions Per Second)

  • Definition: Number of complete business transactions per second
  • Differs from RPS: One transaction might involve multiple requests
  • Example: Checkout flow = 5 HTTP requests
    if RPS = 1,000:
      โ”œโ”€ One checkout = 5 requests
      โ””โ”€ TPS = 1,000 รท 5 = 200 transactions/sec
    
  • When to use: Business metrics, SLA reporting

Success Rate & Error Rate

Success Rate: Percentage of requests that succeeded (HTTP 2xx, 3xx)

Example from load test:
โ”œโ”€ Total requests: 10,000
โ”œโ”€ Successful: 9,950
โ”œโ”€ Failed: 50
โ”œโ”€ Success rate: 99.5%
โ””โ”€ Error rate: 0.5%

SLA: Must be >99% success
Result: โœ“ PASS (99.5% > 99%)

Error Rate: Percentage that failed (4xx, 5xx, timeouts)

Common failure modes:
โ”œโ”€ 4xx errors (4%): Client errors (bad requests)
โ”‚  โ””โ”€ Often: Invalid data, auth failures
โ”œโ”€ 5xx errors (0.5%): Server errors
โ”‚  โ””โ”€ Often: Database down, OOM, unhandled exceptions
โ””โ”€ Timeouts (0.1%): Request never completed
   โ””โ”€ Often: Slow database, external service, queue buildup

Resource Metrics

While running a load test, monitor your system resources on the server being tested. These are usually captured with Datadog.

CPU Utilization

  • Definition: Percentage of CPU time being used
  • During load test:
    0-20%  : System idle, plenty of headroom
    20-50% : Normal load, healthy
    50-80% : Getting busy, approaching limits
    80-95% : Very busy, risk of slow responses
    95-100%: CPU-bound bottleneck, system saturating
    
  • What causes high CPU?
  • Complex calculations (encryption, compression)
  • Inefficient algorithms
  • Thread contention (locks, synchronized blocks)
  • Garbage collection pauses (Java, Go, Python)

Memory Usage

  • Definition: RAM consumed by the application
  • Watch for:
    100 requests:  200MB โœ“ Normal
    1,000 requests: 250MB โœ“ Still reasonable
    10,000 requests: 500MB โœ“ Growing as expected
    100,000 requests: 8GB โŒ Memory leak!
    
  • Common memory issues:
  • Memory leaks (objects not released)
  • Growing caches without eviction
  • Connection pool leaks (connections not returned)
  • Message queues filling up

Disk I/O

  • Definition: Read/write operations to disk
  • During load test:
    Reads: Database queries hitting disk (not in page cache)
    Writes: Log files, database changes, temporary data
    
  • Watch for:
  • High disk I/O โ†’ indicates database queries not cached
  • Disk saturation โ†’ indicates storage bottleneck
  • Example: SSD writes per second should be <10,000 (AWS limit varies)

Network I/O

  • Definition: Bytes sent and received over network
  • During load test:
    Inbound: Requests from load tester
    Outbound: Responses to client, external API calls
    
  • Common issue:
    RPS = 10,000 requests/sec
    Avg response = 10KB
    Outbound = 100,000KB/sec = 100MB/sec
    Network bandwidth = 1Gbps = 125MB/sec
    Headroom = 20% (tight!)
    

Connection Pools

  • Definition: Active database/service connections
  • During load test:
    Pool size: 50 connections (configured max)
    20 load level: 10 connections in use โœ“
    100 load level: 45 connections in use โš 
    200 load level: 50 connections in use + 30 waiting โŒ (queue!)
    
  • Problem: Connection pool exhaustion
    โ””โ”€ Connections slow/blocked
       โ””โ”€ Requests queue up
          โ””โ”€ More requests arrive
             โ””โ”€ Queue grows
                โ””โ”€ Timeouts, cascading failures
    

How Metrics Work Together

Example 1: System Performing Well

Load test: 5,000 RPS for 10 minutes

Metrics during test:
โ”œโ”€ p95 latency: 200ms โœ“ (target: <300ms)
โ”œโ”€ p99 latency: 400ms โœ“ (target: <1000ms)
โ”œโ”€ Success rate: 99.8% โœ“ (target: >99%)
โ”œโ”€ CPU: 65% โœ“ (healthy, room to grow)
โ”œโ”€ Memory: 1.2GB โœ“ (stable, not growing)
โ”œโ”€ Database connections: 30/50 โœ“ (headroom)
โ””โ”€ Disk I/O: Low โœ“ (queries cached)

Conclusion: โœ… SYSTEM HEALTHY
Action: Can handle 2-3x more load safely

Example 2: Database Bottleneck

Load test: Increasing from 1,000 to 10,000 RPS

Observations:
โ”œโ”€ 1,000 RPS: p95 latency = 50ms, CPU 20%
โ”œโ”€ 2,000 RPS: p95 latency = 75ms, CPU 25%
โ”œโ”€ 5,000 RPS: p95 latency = 300ms, CPU 40%, disk I/O โฌ† โฌ† โฌ†
โ”œโ”€ 10,000 RPS: p95 latency = 2000ms, CPU 50%, disk maxed out

Root cause: Disk I/O ceiling
โ”‚
โ””โ”€ Database queries hitting disk (not in cache)
   โ””โ”€ Each query = disk I/O wait
      โ””โ”€ As load increases, more queries queue up
         โ””โ”€ Latency explodes non-linearly

Fix options:
โ”œโ”€ Add caching layer (Redis)
โ”œโ”€ Optimize slow queries (add indexes)
โ”œโ”€ Increase database connection pool
โ”œโ”€ Scale database (read replicas)
โ””โ”€ Re-test after fix

Example 3: GC Pause Impact

Java application under load

JVM GC (Garbage Collection) timeline:
โ”œโ”€ Time 0-45s: Normal operations
โ”‚  โ””โ”€ p95 latency: 100ms, steady
โ”‚
โ”œโ”€ Time 45s: Full GC pause (0.5 seconds)
โ”‚  โ””โ”€ All requests block
โ”‚     โ””โ”€ p95 latency spikes to 2000ms+
โ”‚     โ””โ”€ Requests timeout, 0.1% errors
โ”‚
โ”œโ”€ Time 45.5s: GC completes
โ”‚  โ””โ”€ Requests resume
โ”‚
โ”œโ”€ Time 45.5-50s: High p99 tail
โ”‚  โ””โ”€ Requests queued during GC still processing
โ”‚     โ””โ”€ Takes 2-3 seconds to drain queue
โ”‚
โ””โ”€ Time 50+s: Back to normal

Observation: Every 45 seconds, p95 spikes to 2000ms

Fix options:
โ”œโ”€ Tune JVM heap size (-Xmx, -Xms)
โ”œโ”€ Change GC algorithm (G1GC, ZGC for low latency)
โ”œโ”€ Add more memory to reduce GC frequency
โ””โ”€ Re-test after tuning

Metrics by Load Test Type

Load Test (Baseline)

Monitor these:

โ”œโ”€ p50, p95, p99 latencies โ† Should be stable
โ”œโ”€ RPS / TPS โ† Should be consistent
โ”œโ”€ Success rate โ† Should be >99%
โ”œโ”€ CPU/Memory โ† Should be steady
โ””โ”€ Resource utilization โ† Should not be maxing out

Stress Test (Breaking Point)

Monitor:

โ”œโ”€ p95, p99, p99.9 latencies โ† Will increase
โ”œโ”€ Error rate โ† Should increase as you approach limit
โ”œโ”€ RPS plateau โ† Where does it max out?
โ”œโ”€ CPU/Memory peaks โ† What triggers saturation?
โ””โ”€ Recovery time โ† How long to stabilize?

Soak Test (Long-term Stability)

Monitor over 8-24 hours:

โ”œโ”€ Memory trend โ† Growing (leak) or stable?
โ”œโ”€ p95 latency trend โ† Increasing (degradation) or stable?
โ”œโ”€ Connection count โ† Growing (leak) or stable?
โ”œโ”€ GC pause frequency โ† Increasing (more pressure)?
โ””โ”€ Error rate trend โ† Any anomalies over time?

Spike Test (Recovery)

Monitor during spike and recovery:

โ”œโ”€ Before spike: p95 = 50ms
โ”œโ”€ During spike: p95 = 1000ms (accept 20x increase)
โ”œโ”€ After spike: p95 โ†’ 50ms (should recover within 2-3 minutes)
โ””โ”€ System availability โ† Did it stay up?


Target Metrics by Service Type

Service p95 Target p99 Target Success Rate
Web App <300ms <1000ms >99.5%
Mobile API <200ms <500ms >99.5%
Internal API <100ms <300ms >99.9%
Real-time <50ms <100ms >99.9%
Batch/Event <5000ms <30000ms >99%
Kafka Stream <100ms (produce) <500ms >99.9%

Tools for Collecting Metrics

Gatling Built-in

Metrics are automatically collected:

โ”œโ”€ Response times (min, max, percentiles)
โ”œโ”€ Success/error rates
โ”œโ”€ Request counts by endpoint
โ””โ”€ HTML report with charts

โ”œโ”€ Real-time metrics
โ”œโ”€ Trace-level detail (which database query was slow?)
โ”œโ”€ Custom metrics and annotations
โ”œโ”€ Alert thresholds
โ””โ”€ Dashboard queries

Application Monitoring

โ”œโ”€ New Relic
โ”œโ”€ Dynatrace
โ”œโ”€ Splunk
โ””โ”€ Your own metrics (StatsD, Prometheus)

Next Steps

โ†’ Read next: Load Testing Methodology - How to plan and execute tests