Distributed Testing

Overview

When a single machine can't generate enough load, distribute testing across multiple machines.

Single Machine Limitation

1 machine: 10,000 concurrent users max
Need: 50,000 concurrent users
Solution: Distribute across 5 machines

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Gatling Enterprise Controller       β”‚
β”‚  (Coordinates and aggregates results)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↙          ↓          β†–
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚Agent 1 β”‚ β”‚Agent 2 β”‚ β”‚Agent 3 β”‚
    β”‚3,000   β”‚ β”‚3,000   β”‚ β”‚3,000   β”‚
    β”‚users   β”‚ β”‚users   β”‚ β”‚users   β”‚
    β”‚10Gbps  β”‚ β”‚10Gbps  β”‚ β”‚10Gbps  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓          ↓          ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Target System Under Test   β”‚
    β”‚   (Receives 30,000 users)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Setup Options

βœ“ Managed platform
βœ“ Automatic agent coordination
βœ“ Built-in result aggregation
βœ“ Visual reporting
βœ“ Compliance features
βœ— Requires subscription

Option 2: Open Source Distributed Setup

βœ“ Free
βœ“ Full control
βœ— Manual coordination
βœ— Manual result aggregation
βœ— More operational overhead

Open Source Setup

Step 1: Prepare Agents

On each agent machine:

# Machine 1, 2, 3 (Ubuntu servers)
curl -X PUT -d -u admin:admin http://localhost:8080/gatling/data/simulation \
     -H "Content-Type: application/json" \
     -d @simulation.json

Step 2: Configure Simulation

// Reference simulation on agents
// Each agent runs same simulation with different user offset

public class Sim_DistributedLoad extends Simulation {

    // Get agent ID (0, 1, 2, etc.)
    String agentId = System.getProperty("gatling.agentId", "0");

    // Distribute users across agents
    // Agent 0: users 0-9999
    // Agent 1: users 10000-19999
    // Agent 2: users 20000-29999

    int userOffset = Integer.parseInt(agentId) * 10000;

    ScenarioBuilder scenario = scenario("Distributed Load")
        .feed(userFeeder.offset(userOffset))
        .exec(http("Request").get("/api"));
}

Step 3: Run on Each Agent

# Agent 1
mvn gatling:test \
  -Dgatling.simulationClass=Sim_DistributedLoad \
  -Dgatling.agentId=0

# Agent 2
mvn gatling:test \
  -Dgatling.simulationClass=Sim_DistributedLoad \
  -Dgatling.agentId=1

# Agent 3
mvn gatling:test \
  -Dgatling.simulationClass=Sim_DistributedLoad \
  -Dgatling.agentId=2

Step 4: Aggregate Results

# Collect results from each agent
# Manually merge CSV files:

cat agent1/results.csv agent2/results.csv agent3/results.csv > combined.csv

# Calculate aggregated metrics
# Total requests = sum of all agents
# P95 latency = p95 of combined data

Synchronization Challenges

Problem 1: Agents Start at Different Times

Agent 1: starts at 10:00:00
Agent 2: starts at 10:00:05  ← 5 second delay
Agent 3: starts at 10:00:10  ← 10 second delay

Result: Load is staggered, not simultaneous

Solution: Synchronized Start

// Use barrier to wait for all agents
Barrier barrier = new Barrier(3);  // 3 agents

// All agents wait at barrier
barrier.await();  // Blocks until all 3 reach this point

// Then start simultaneously
setUp(scenario.injectOpen(...))

Data Collection & Aggregation

Per-Agent Results

Agent 1: 10,000 requests, p95=450ms, errors=2
Agent 2: 10,000 requests, p95=480ms, errors=3
Agent 3: 10,000 requests, p95=520ms, errors=1

Aggregated Results

Total: 30,000 requests
P95: ((450*10000 + 480*10000 + 520*10000) / 30000) = 483ms
Errors: 2 + 3 + 1 = 6 (0.02% error rate)

Network Bandwidth Considerations

Bandwidth Required

Per user: ~1MB data per second
Per machine (1000 concurrent users): ~1Gbps
Per machine (10,000 concurrent users): ~10Gbps

3 machines with 10,000 users each:
β”œβ”€ Each machine: 10Gbps
β”œβ”€ Total to target: 30Gbps
└─ Network: Must have β‰₯30Gbps capacity

Network Planning

Datacenter network: Typically 10Gbps per server
3 servers: 30Gbps total available
3 servers hitting target: 30Gbps required
Result: Perfect fit (but no headroom)

Better: Use 5 machines with 6,000 users each
β”œβ”€ Per machine: 6Gbps
β”œβ”€ Total: 30Gbps (same)
└─ Headroom: Yes, less contention

Best Practices

1. Network Isolation

Agents and target on same network
└─ Minimize latency

Avoid routing through internet
└─ Variable latency ruins test

2. Time Synchronization

# All machines must have synchronized clocks
ntpdate -u ntp.ubuntu.com  # Sync to NTP

# Verify
timedatectl  # Check clock is synchronized

3. Resource Sizing

Per agent machine:
β”œβ”€ CPU: 16 cores (for 10,000 users)
β”œβ”€ RAM: 32GB (for 10,000 users)
β”œβ”€ Network: 10Gbps+ NIC
└─ Storage: Fast SSD for logging

4. Monitoring Agents

Monitor each agent during test:
β”œβ”€ CPU: Should not exceed 80%
β”œβ”€ Memory: Should not exceed 80%
β”œβ”€ Network: Should not exceed 90%

If exceeded: Add more agents, reduce per-agent users

Troubleshooting

Issue: Uneven Load Distribution

Agent 1: 10,000 requests
Agent 2: 8,000 requests
Agent 3: 9,000 requests

Problem: Agents started at different times
Solution: Add synchronization barrier

Issue: Agent Runs Out of Memory

Error: OutOfMemoryError
Solution: Reduce users per agent or increase JVM heap

Issue: Network Bandwidth Maxed

Observation: Network at 100%, latency high
Solution: Add more agents with fewer users each

Gatling Enterprise Alternative

For production-grade distributed testing:

Pros:
βœ“ Automatic scaling (0-100,000+ users)
βœ“ Managed cloud infrastructure
βœ“ Built-in reporting
βœ“ Real-time dashboards
βœ“ Compliance features

Cons:
βœ— Cost ($$$)
βœ— Less control
βœ— Vendor lock-in

Use when: Load >50,000 users, team size >5, budget available

Key Takeaways

  1. Distributed testing = Multiple machines generating load
  2. Coordination = Synchronize start, aggregate results
  3. Network bandwidth = Plan for 10-30Gbps
  4. Agent sizing = 10,000 users per 16-core machine
  5. Monitoring = Watch CPU, memory, network on each agent
  6. Gatling Enterprise = Simplified alternative for large tests

← Previous: Optimization Tips
β†’ Next: [Quick Reference]](01-quick-reference.md)
↑ Up: Documentation Index