What is Performance Testing?

Overview

Performance testing is the practice of systematically evaluating how your system behaves under realistic and stress conditions. Instead of testing whether your system works, you're testing whether it works well enough under expected and extreme loads.

The Core Question

"How does our system behave as load increases?"

Why Performance Testing Matters

Imagine releasing a new e-commerce feature:

✓ All unit tests pass
✓ All integration tests pass
✓ Manual testing passes
✓ Code review approved
  ... but on Black Friday sales launch...

❌ Site crashes under 10,000 concurrent users
❌ Response times jump from 200ms to 30 seconds
❌ Database connections exhausted
❌ Cache thrashing, cascading failures

Performance testing would have caught this before production.

Business Impact of Performance Issues

Scenario	Impact
Response time >3 seconds	40% user abandonment (Google studies)
Site downtime	~$5,600 lost per minute for mid-size e-commerce
Slow API	10% decrease in mobile app engagement
Database bottleneck	100% CPU, service unavailable, SLA breach

What Performance Testing Reveals

1. Capacity: How Much Load Can You Handle?

Load (users) ↑    Latency (response time)
    1,000         150ms ✓ Acceptable
   10,000         200ms ✓ Still good
  100,000         500ms ⚠ Approaching limit
  500,000       5,000ms ❌ Unacceptable
1,000,000       System crashes

Outcome: You know the maximum sustainable load before degradation.

2. Performance: How Fast Do Responses Come?

Without load:     GET /api/users → 50ms  (server idle)
With 100K users:  GET /api/users → 500ms (CPU-bound)
With 500K users:  GET /api/users → 3000ms (I/O-bound, DB queue)

Outcome: You understand the performance curve and where bottlenecks emerge.

3. Stability: Does the System Degrade Gracefully?

✓ Good: Latency increases linearly, recovers when load drops
✗ Bad: Latency spikes unpredictably, doesn't recover
❌ Catastrophic: Crash, data corruption, cascading failures

Outcome: You know if your system can degrade gracefully under overload.

4. Bottlenecks: Where is Time Actually Spent?

Request Timeline (500ms total):
├─ API Gateway: 10ms
├─ Authentication: 5ms
├─ Database query: 400ms ⬅️ BOTTLENECK!
├─ Serialization: 50ms
├─ Network: 35ms
└─ Total: 500ms

Outcome: You know exactly what to optimize (database in this case).

Types of Performance Testing

1. Load Testing

Tests system behavior under expected, realistic load.

Scenario: E-commerce site expecting 500 concurrent users during normal hours

Load pattern:
├─ Start: 0 users
├─ Ramp up: 500 users over 5 minutes
├─ Hold: 500 users for 30 minutes
└─ Ramp down: Back to 0 users

Objective: Does the system meet SLAs (p95 <500ms, <0.1% errors)?

Metric	Target	Status
p95 latency	<500ms	400ms ✓
p99 latency	<1000ms	800ms ✓
Error rate	<0.1%	0.05% ✓
Result	Pass SLA	✅ PASS

2. Stress Testing

Tests system behavior at breaking point and beyond.

Scenario: "What is the absolute maximum load we can sustain?"

Load pattern:
├─ Start: 0 users
├─ Ramp up: 5,000 users per minute (very aggressive)
├─ Continue until: System crashes or latency becomes unbearable
└─ Observe: Where does it break?

Observations:
├─ 2,000 users: p95=400ms (OK)
├─ 5,000 users: p95=1000ms (slow but stable)
├─ 10,000 users: p95=5000ms (degrading)
├─ 12,000 users: p99=30,000ms (queue building up)
└─ 15,000 users: 50% errors, cascading failures

Conclusion: Safe limit is ~10,000 concurrent users

3. Soak Testing

Tests system stability over extended time under normal load.

Scenario: "Does the system stay healthy over 24 hours?"

Load pattern:
├─ Start: 0 users
├─ Ramp: 500 users over 10 minutes
├─ Hold: 500 users for 24 hours (86,400 seconds)
├─ Monitor: Memory, CPU, connection pools, GC pauses

Potential issues detected:
├─ Memory leak: Heap grows from 1GB → 8GB over 24 hours
├─ Connection pool exhaustion: Connections not returned
├─ Slowdown: p95 latency drifts from 200ms → 1000ms
└─ Cache thrashing: Cache hit rate degrades

Action: Find and fix leaks before production deployment

4. Spike Testing

Tests system recovery from sudden traffic surge.

Scenario: "New feature goes viral on Twitter (10,000 sudden users)"

Load pattern:
├─ Start: 100 users (normal)
├─ Spike: Jump to 10,000 users instantly
├─ Hold: 10,000 users for 5 minutes
├─ Normal: Drop back to 100 users
└─ Observe: Recovery behavior

Expected behavior:
├─ During spike: Acceptable latency increase (maybe 2-3x)
├─ Circuit breaker triggers: Graceful degradation
├─ Auto-scaling activates: New instances spin up
├─ After spike: Smoothly returns to normal latency

Bad behavior:
├─ System crashes (cascading failures)
├─ Doesn't recover even after load drops
├─ Database locked, data corruption

Comparison: All Four Types

Type	Load Pattern	Duration	Purpose	Real-World Example
Load	Realistic, constant	5-30 min	Baseline performance	Normal business hours
Stress	Continuously increase	Until break	Find limits	How much can we grow?
Soak	Normal load, extended	8-24 hours	Long-term stability	Overnight batch processing
Spike	Sudden jump, then drop	5-10 min	Recovery behavior	Viral content, flash sales

Performance Testing Lifecycle

1. PLAN
   ├─ Define objectives (SLAs)
   ├─ Identify critical paths
   └─ Set success criteria (p95 <500ms, <0.1% errors)

2. DESIGN SCENARIOS
   ├─ User journeys (realistic behavior)
   ├─ Data composition (realistic data mix)
   └─ Load patterns (constant, ramp, spike, etc.)

3. IMPLEMENT SIMULATIONS
   ├─ Code scenarios in Gatling
   ├─ Create feeders (user data, product data)
   └─ Set up assertions and checks

4. EXECUTE TESTS
   ├─ Run in staging environment (not production!)
   ├─ Monitor system metrics (CPU, memory, disk)
   ├─ Capture APM traces (Datadog)
   └─ Record all metrics

5. ANALYZE RESULTS
   ├─ Compare against SLAs
   ├─ Find bottlenecks (CPU, database, network)
   ├─ Correlate with application traces
   └─ Identify root causes

6. OPTIMIZE & RE-TEST
   ├─ Fix bottleneck (scale, cache, optimize)
   ├─ Re-run same test
   ├─ Verify improvement
   └─ Repeat until SLA met

7. DEPLOY CONFIDENTLY
   ├─ Production deployment with confidence
   ├─ Set up monitoring alerts
   └─ Track real-world performance

Key Questions Answered by Performance Testing

Before Performance Testing

❓ "Will our system handle 10,000 users?"
❓ "What's the maximum load we can sustain?"
❓ "Where will our system break?"
❓ "How do we reduce latency?"
❓ "Can we scale to 100x current traffic?"

After Performance Testing

✅ "Yes, it handles 10,000 users with p95 latency of 400ms"
✅ "Maximum sustainable load is 15,000 concurrent users"
✅ "Database queries are the bottleneck (400ms out of 500ms)"
✅ "Add query caching and connection pooling; optimize N+1 queries"
✅ "With the optimizations, yes; scaling is straightforward"

Common Misconceptions

❌ "Performance testing is only for high-traffic sites"

Reality: Any system deployed to production should be performance tested. - A banking app with 1,000 users needs different SLAs than an e-commerce site with 100,000 users - Even "small" systems degrade under load; you need to know your limits

❌ "We'll optimize after deployment if needed"

Reality: Post-production optimization is exponentially more expensive. - Production issues impact real users (reputation damage, revenue loss) - Debugging under real load is chaotic - Scaling out without understanding bottlenecks wastes infrastructure

❌ "One performance test is enough"

Reality: You need multiple test types. - Load test establishes baseline - Stress test reveals limits - Soak test catches long-term issues - Spike test validates resilience - Re-test after every major change

❌ "Performance testing is slow and expensive"

Reality: Modern frameworks like Gatling make it fast and cheap. - Gatling simulates 10,000+ concurrent users on a single machine - No need for physical test labs with 10,000 computers - Tests complete in 10-30 minutes - Total cost: low (just infra for a staging environment)

When to Perform Load Tests

✅ Always Test Before:

Major release to production
High-traffic event (Black Friday, sale launch)
Feature scaling (new service, redesign)
Infrastructure change (database migration, version upgrade)
Deployment to production (final gate)

✅ Regularly Test:

Before each sprint release (part of Definition of Done)
After optimization (verify improvement)
Quarterly baseline (track system behavior over time)
Before and after scaling (validate scaling strategy)

⚠️ Be Careful When Testing:

Against production systems → Always use staging
During business hours → Could impact monitoring, logs
Without stakeholder knowledge → Could trigger alerts

What You'll Learn

In this documentation, you'll learn how to:

Design realistic load testing scenarios
Implement tests in Gatling (HTTP and Kafka)
Run tests with different load patterns (constant, ramp, step, spike)
Analyze results and find bottlenecks
Monitor tests with Datadog APM
Optimize your system based on findings
Integrate load testing into CI/CD pipelines

Next Steps

→ Read next: Key Metrics & Measurements - Understand what to measure and why

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search