Load Testing Methodology

The Systematic Approach

Performance testing isn't random. It's a disciplined process:

1. PLAN: Define what you're testing and why
   ↓
2. DESIGN: Create realistic scenarios
   ↓
3. IMPLEMENT: Code simulations in Gatling
   ↓
4. EXECUTE: Run tests systematically
   ↓
5. ANALYZE: Understand the results
   ↓
6. OPTIMIZE: Fix bottlenecks
   ↓
7. RE-TEST: Verify improvements
   ↓
8. DEPLOY: With confidence

Phase 1: PLAN (Define Objectives)

Step 1a: Document Success Criteria (SLAs)

Before you write any code, define what "success" looks like:

Question: What must be true for the test to PASS?

Examples:

1. E-commerce checkout
   ├─ p95 latency: <500ms
   ├─ p99 latency: <1000ms
   ├─ Success rate: >99.5% (max 0.5% errors)
   └─ Achieve: 2,000 concurrent users

2. Mobile API
   ├─ p95 latency: <200ms
   ├─ p99 latency: <500ms
   ├─ Success rate: >99.9%
   └─ Achieve: 5,000 RPS

3. Real-time analytics service
   ├─ p95 latency: <50ms
   ├─ p99 latency: <100ms
   ├─ Success rate: >99.99%
   └─ Achieve: 10,000 RPS

Pro tip: Involve stakeholders in defining SLAs! - Product team: What do users tolerate? - Ops team: What can infrastructure support? - Engineering: What's reasonable to achieve?

Step 1b: Identify Critical User Paths

List the most important scenarios to test:

E-commerce example:

Rank 1: Browse & Purchase (most revenue impact)
├─ Search products
├─ View product details
├─ Add to cart
├─ Checkout
└─ Payment

Rank 2: Account Management
├─ Login
├─ View order history
├─ Update profile
└─ Change password

Rank 3: Customer Support
├─ View FAQ
├─ Submit ticket
└─ Chat

→ Focus load testing on Rank 1 paths first!

Step 1c: Estimate Expected Load

Question: How many concurrent users will we have?

Methods:

1. Historical data
   └─ "We had 10,000 concurrent users on Black Friday"

2. Capacity planning
   └─ "We want to grow to 50,000 users/day = 500 concurrent"

3. Industry benchmarks
   └─ "Industry average is 10% simultaneous"

4. Peak calculation
   └─ "100,000 daily active users × 5% = 5,000 peak concurrent"

→ TEST TO 2-3x EXPECTED LOAD (safety margin)

Phase 2: DESIGN (Create Scenarios)

Step 2a: Define Realistic User Behavior

Bad: 1,000 users all doing the exact same thing at the exact same time - Unrealistic - Tests a specific bottleneck, not general behavior - Doesn't reveal cascade failures

Good: Users follow realistic patterns with variety

Realistic user behavior:
├─ Not all users do the same thing
│  └─ 70% browse
│  └─ 20% purchase
│  └─ 10% use support
│
├─ Users have think-time (don't hammer continuously)
│  └─ Read product description: 10-30 seconds
│  └─ Check reviews: 5-15 seconds
│  └─ Decide to purchase or not: 30-60 seconds
│
├─ Request patterns vary
│  └─ Sometimes cache hit (fast)
│  └─ Sometimes cache miss (slow)
│  └─ Occasional errors (network, external service)
│
└─ Data varies
   └─ Different users, products, regions
   └─ Different request payloads

Implementation in Gatling: Use feeders + pauses

scenario("Realistic User")
    .feed(userFeeder)  // Different user each iteration
    .exec(http("Search").get("/search?q=#{query}"))
    .pause(10, 30)     // 10-30 sec think-time
    .exec(http("View Details").get("/product/#{productId}"))
    .pause(5, 15)
    // ... more realistic actions ...

Step 2b: Design Multiple Scenarios

Scenario 1: Load Test (Baseline)
├─ Constant load: 1,000 users for 15 minutes
├─ Objective: "Does it meet SLAs under normal load?"
├─ Success: p95 <500ms, success rate >99%
└─ Duration: ~20 minutes

Scenario 2: Ramp Test (Find Limits)
├─ Start: 100 users, ramp to 5,000 users over 30 minutes
├─ Objective: "At what load do SLAs break?"
├─ Watch for: Where does p95 spike? Where do errors start?
└─ Duration: ~40 minutes

Scenario 3: Step Test (Threshold Analysis)
├─ Step 1: 1,000 users for 5 min
├─ Step 2: 2,000 users for 5 min
├─ Step 3: 3,000 users for 5 min
├─ Step 4: 4,000 users for 5 min
├─ Objective: "At what level does each metric degrade?"
├─ Watch for: Cache warmup, JVM behavior at each level
└─ Duration: ~25 minutes

Scenario 4: Spike Test (Recovery)
├─ Normal: 500 users for 5 min
├─ Spike: Jump to 5,000 users for 5 min
├─ Recovery: Back to 500 users for 10 min
├─ Objective: "Can the system recover from sudden load?"
└─ Duration: ~25 minutes

Scenario 5: Soak Test (Long-term Stability)
├─ Constant: 1,000 users for 24 hours
├─ Objective: "Does the system stay stable overnight?"
├─ Watch for: Memory leaks, connection leaks, degradation
└─ Duration: 24 hours (run overnight)

Step 2c: Data Strategy

Question: What data will you use in the test?

Options:

1. Random data (auto-generated)
   └─ No setup needed
   └─ Sometimes unrealistic (fake emails, nonsense values)
   └─ Good for: Basic load testing

2. CSV file with real data
   └─ Realistic values (real user IDs, emails, products)
   └─ Can simulate real data patterns
   └─ Good for: Realistic testing

3. Custom generator (Java code)
   └─ Full control over data generation
   └─ Can create complex, business-logic-aware data
   └─ Good for: Advanced testing

Example: Testing purchase flow

Option 1 (Bad): Random product ID
├─ productId: "xyz123random456"
└─ API returns: "Product not found" error
   └─ Test is invalid!

Option 2 (Good): CSV file
├─ CSV has 1,000 valid product IDs
├─ Test uses: "P123", "P456", "P789", etc.
└─ API returns: Valid product data

Option 3 (Best): Custom generator
├─ Java code knows: "Products are P0001 to P9999"
├─ Java code also knows: "Premium users can buy any product"
├─ Java code also knows: "Normal users can only buy available items"
└─ Test simulates: Real business logic constraints

Phase 3: IMPLEMENT (Code Simulations)

This is where you write Gatling code. (See Labs 1-8 for detailed examples.)

Quick template:

public class MyLoadTest extends Simulation {

  // 1. Protocol (where to send requests)
  HttpProtocolBuilder httpProtocol = http
      .baseUrl("https://api.example.com")
      .header("Content-Type", "application/json");

  // 2. Scenario (what users do)
  ScenarioBuilder scenario = scenario("User Journey")
      .feed(csvFeeder)  // Inject data
      .exec(http("GET products").get("/products"))
      .pause(5)  // Think time
      .exec(http("POST purchase").post("/purchase")
          .body(StringBody("{...}")));

  // 3. Setup (how many users, how fast)
  {
    setUp(
        scenario.injectOpen(constantUsersPerSec(100).during(600))
    )
    .protocols(httpProtocol)
    .assertions(
        global().responseTime().p95().lt(500),  // SLA
        global().successfulRequests().percent().gt(99.0)
    );
  }
}

Phase 4: EXECUTE (Run Tests)

Step 4a: Pre-Test Checklist

Before you hit "run":

Infrastructure:
  ☐ Staging environment is clean (no other traffic)
  ☐ Databases are reset to known state
  ☐ Caches are cleared (or warmed up, depending on scenario)
  ☐ Monitoring is enabled (Datadog, APM, etc.)
  ☐ Log level is set appropriately (ERROR level to avoid log spam)

Gatling:
  ☐ Simulation compiles with no errors
  ☐ Smoke test passes (1 user, single iteration)
  ☐ Test data is available (CSV files, feeders)
  ☐ Assertions are set correctly (SLAs)
  ☐ Duration is reasonable (don't run 24-hour soak on first try)

Team:
  ☐ Stakeholders are informed (don't surprise ops team)
  ☐ On-call engineer is available if needed
  ☐ Slack/email channel is open for communication
  ☐ Plan to cancel test if something goes wrong

Step 4b: Execution Pattern

1. Smoke Test (warm-up)
   └─ Run with 1 user for 1-2 minutes
   └─ Verifies: Code compiles, API responds, no crashes
   └─ If fails: Fix and retry before real test

2. Wait 10 minutes
   └─ Let system settle
   └─ Let logs clear
   └─ Let monitoring reset

3. Run Load Test
   └─ Your actual test (constant, ramp, step, spike)
   └─ Record all metrics and logs

4. Wait 10 minutes
   └─ Let system cool down
   └─ Monitor for memory leaks or slow recovery

5. If all good: Run next scenario
   └─ E.g., after load test, run ramp test
   └─ Or, schedule for next day

Don't run back-to-back tests without cool-down!

Step 4c: During Test: Monitor Everything

Gatling console (live):
├─ Active users (ramping up?)
├─ Throughput (RPS increasing or plateau?)
├─ Response times (p95, p99 trending?)
├─ Errors (any appearing?)
└─ Success rate (dropping?)

Server monitoring (Datadog/APM):
├─ CPU usage (climbing?)
├─ Memory (stable or growing?)
├─ Database connections (free or exhausted?)
├─ Disk I/O (at ceiling?)
├─ Network (saturated?)
└─ Any errors in logs?

Be ready to abort if:
  ❌ Server crashes
  ❌ Error rate suddenly spikes >10%
  ❌ System doesn't recover (GC pauses endless)
  ❌ Cascading failures detected

Phase 5: ANALYZE (Review Results)

Step 5a: Read Gatling Report

After test completes, open:
  target/gatling/[simulation-name-timestamp]/index.html

Key sections:

1. Global Stats (top-level summary)
   ├─ Total requests sent
   ├─ Success/failure counts
   ├─ Min, mean, p50, p75, p95, p99 latencies
   └─ Requests per second

2. Request Detail (by endpoint)
   ├─ GET /products: p95=150ms
   ├─ POST /purchase: p95=500ms ← Slower!
   └─ GET /order: p95=200ms

3. Scenario (load ramp-up timeline)
   ├─ Shows: How many users active at each second
   └─ Validates: Did load ramp as expected?

4. Response Time Distribution (histogram)
   ├─ Visual: Most requests are fast
   └─ Tail: Few requests are very slow

5. Errors (if any)
   ├─ Error type: 500 Internal Server Error
   ├─ Count: 50 times
   └─ Timeline: Appeared after 10 minutes

Step 5b: Compare Against SLAs

Example: E-commerce load test

SLAs defined:
├─ p95 latency < 500ms
├─ p99 latency < 1000ms
├─ Success rate > 99%
└─ Support 2,000 concurrent users

Results:
├─ p95 latency: 480ms ✓ PASS
├─ p99 latency: 950ms ✓ PASS
├─ Success rate: 99.2% ✓ PASS
└─ Handled 2,000 concurrent ✓ PASS

Final result: ✅ TEST PASSED

Step 5c: Find Bottlenecks

If test failed, diagnose why:

Symptom 1: Latency increases linearly with load
├─ Likely cause: Resource exhaustion (CPU, memory, disk)
├─ Evidence: Check Datadog CPU/memory graphs
└─ Fix: Optimize code, add caching, scale infrastructure

Symptom 2: Latency increases exponentially
├─ Likely cause: Queue buildup (request backlog)
├─ Evidence: Response times spike, then don't recover
└─ Fix: Increase capacity or reduce incoming load

Symptom 3: Error rate increases suddenly
├─ Likely cause: Something overflowing (connection pool, memory)
├─ Evidence: Check error types in Gatling report
└─ Fix: Increase pool size, fix memory leak, scale service

Symptom 4: Certain endpoints slow, others fast
├─ Likely cause: Specific bottleneck (slow DB query)
├─ Evidence: Datadog traces show slow query in one endpoint
└─ Fix: Optimize that specific query

Symptom 5: p95 OK but p99 bad
├─ Likely cause: GC pauses, occasional queuing
├─ Evidence: p99 spikes, but p95 stable
└─ Fix: JVM tuning, improve code efficiency

Step 5d: Use Datadog to Drill Down

When Gatling says "latency is bad", Datadog tells you WHY:

Query in Datadog:
  trace.web.request.duration{service:my-api}
  by resource_name

Results:
├─ GET /api/products: p95=150ms ✓ Fast
├─ POST /api/purchase: p95=800ms ⚠ Slow
│  └─ Drill down into traces
│     └─ Find: Call to external payment service (600ms)
│        └─ Root cause: Payment API is slow
│           └─ Fix: Add timeout, use async, add circuit breaker
│
└─ GET /api/user/profile: p95=200ms ✓ Fast

Phase 6: OPTIMIZE

Based on analysis, fix bottlenecks:

Examples:

1. Slow database query
   ├─ Evidence: Datadog shows 400ms in SELECT query
   ├─ Fixes:
   │  ├─ Add index on WHERE column
   │  ├─ Fetch only needed columns (not *)
   │  ├─ Add database caching
   │  └─ Use read replica for high-traffic endpoints
   └─ Re-test: Verify improvement

2. CPU-bound processing
   ├─ Evidence: CPU 100%, latency high
   ├─ Fixes:
   │  ├─ Profile with JFR (Java Flight Recorder)
   │  ├─ Find hot methods
   │  ├─ Optimize algorithm or use caching
   │  └─ Consider moving to async processing
   └─ Re-test: Verify improvement

3. Exhausted connection pool
   ├─ Evidence: Errors "Too many connections"
   ├─ Fixes:
   │  ├─ Increase pool size
   │  ├─ Fix connection leaks (ensure close() called)
   │  ├─ Use connection pooling library (HikariCP)
   │  └─ Reduce query time (so connections released faster)
   └─ Re-test: Verify improvement

4. Memory leak
   ├─ Evidence: Heap grows 1GB → 8GB over test
   ├─ Fixes:
   │  ├─ Use heap dump analyzer (JProfiler, YourKit)
   │  ├─ Find object holding references
   │  ├─ Fix leak (remove listener, close resource)
   │  └─ Verify leak is gone
   └─ Re-test: Verify improvement

Phase 7: RE-TEST

After optimization:

1. Run same test scenario again
   └─ Compare metrics to baseline
   └─ Verify improvement (e.g., p95 dropped from 500ms → 300ms)

2. Run to new load level
   └─ If you optimized, try higher load
   └─ If p95 was 500ms at 1000 users, test 2000 users now

3. Run soak test
   └─ Ensure optimization didn't introduce memory leak
   └─ Run 2-8 hours at normal load

Success criteria:
✅ Metrics improved
✅ New SLAs are met
✅ No new issues introduced

Phase 8: DEPLOY

With confidence:

Before production deployment:
├─ Load test passed ✓
├─ Soak test passed ✓
├─ Code reviewed ✓
├─ Rollback plan ready ✓
└─ Team prepared ✓

Deployment strategy:
├─ Canary: 10% traffic to new version
├─ Monitor: Watch metrics, error rate, latency
├─ Expand: 50% traffic if metrics good
├─ Expand: 100% traffic if still good
└─ Rollback plan: If issues detected, revert instantly

Best Practices Summary

✅ DO:

Define SLAs before testing
Test realistic user behavior (think-time, variety)
Test multiple scenarios (load, ramp, soak, spike)
Monitor system resources during test
Use Datadog/APM to find bottlenecks
Re-test after optimization
Test in staging, not production
Document findings

❌ DON'T:

Run test without clear objective
Test with unrealistic data or behavior
Assume one test tells the whole story
Ignore spike in error rate during test
Fire all load from one machine (becomes bottleneck)
Deploy without testing (or with minimal testing)
Test at same load level each time (no learning)

Next Steps

→ Read next: Open Load Patterns - Specific load pattern techniques

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search