Performance Testing Guide

Overview

This guide provides comprehensive information on performance testing practices for BitVelocity services.

Performance Testing Pyramid

         /\
        /E2E\         Endurance Tests (monthly)
       /------\
      / Stress \      Stress & Spike Tests (weekly)
     /----------\
    /   Load     \    Load Tests (nightly)
   /--------------\
  /     Smoke      \  Smoke Tests (every PR)
 /------------------\

Test Types

1. Smoke Tests (CI Pipeline)

Purpose: Quick validation, regression detection
Tool: k6
Duration: < 1 minute
Load: 10 VUs
Frequency: Every PR

Example:

cd bv-performance-testing/k6-scripts
k6 run --vus 10 --duration 30s api-smoke-test.js

Success Criteria:

p95 latency < 200ms
Error rate < 1%
No degradation > 10% vs baseline

2. Load Tests

Purpose: Validate performance under expected load
Tool: Gatling (Java)
Duration: 5-15 minutes
Load: Ramp to 100-500 users
Frequency: Nightly

Example:

cd bv-performance-testing/gatling-tests
./gradlew gatlingRun -Psimulation=OrderFlowSimulation

Success Criteria:\n\n- Meet SLI targets (see performance-baselines/sli-targets.yaml) - No errors > 1% - Resource utilization < 80%

3. Stress Tests

Purpose: Find system breaking points
Tool: Gatling (Java)
Duration: 15-30 minutes
Load: Incrementally increase until failure
Frequency: Weekly

Key Metrics:

Maximum sustainable load
Point of degradation
Recovery behavior
Error patterns

4. Spike Tests

Purpose: Test sudden load increases (flash sales, viral content)
Tool: Gatling (Java)
Duration: 5-10 minutes
Load: Immediate spike to 200+ users
Frequency: Weekly

Validation:

Circuit breakers activate appropriately
No cascading failures
System recovers gracefully

5. Endurance Tests

Purpose: Detect memory leaks, resource exhaustion
Tool: Gatling (Java)
Duration: 2-4 hours
Load: Sustained moderate load
Frequency: Monthly or before major releases

Monitor:

Memory usage trends
Connection pool leaks
Database query performance over time
Cache behavior

Writing Performance Tests

Gatling (Java) Test Structure

public class MySimulation extends Simulation {

    // HTTP Protocol Configuration
    HttpProtocolBuilder httpProtocol = http
        .baseUrl("http://localhost:8080")
        .acceptHeader("application/json");

    // Test Data Feeder
    Iterator<Map<String, Object>> feeder = 
        Stream.continually(() -> Map.of(
            "userId", "USER-" + random.nextInt(1000)
        )).iterator();

    // Scenario Definition
    ScenarioBuilder scenario = scenario("My Scenario")
        .feed(feeder)
        .exec(
            http("Request Name")
                .get("/api/v1/resource")
                .check(status().is(200))
        )
        .pause(Duration.ofSeconds(1));

    // Load Profile
    {
        setUp(
            scenario.injectOpen(
                rampUsers(100).during(Duration.ofMinutes(5))
            ).protocols(httpProtocol)
        ).assertions(
            global().responseTime().percentile(95.0).lt(200),
            global().successfulRequests().percent().gt(95.0)
        );
    }
}

k6 Test Structure

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  vus: 10,
  duration: '30s',
  thresholds: {
    http_req_duration: ['p(95)<200'],
    errors: ['rate<0.01'],
  },
};

export default function () {
  const response = http.get('http://localhost:8080/api/v1/resource');

  check(response, {
    'status is 200': (r) => r.status === 200,
    'response time < 200ms': (r) => r.timings.duration < 200,
  });

  sleep(1);
}

Performance Baselines

All services must define SLI targets in performance-baselines/sli-targets.yaml:

services:
  order-service:
    availability_target: 99.0
    endpoints:
      - path: "POST /api/v1/orders"
        latency:
          p50: 50ms
          p95: 200ms
          p99: 500ms
        throughput: 100
        error_rate_max: 1.0

Analyzing Results

Key Metrics

Latency Percentiles
p50 (median): Typical user experience
p95: Most users' experience
p99: Worst-case scenarios
p99.9: Outliers
Throughput
Requests per second
Transactions per second
Messages per second
Error Rate
HTTP 5xx errors
Timeouts
Connection failures
Resource Utilization
CPU usage
Memory usage
Network I/O
Disk I/O

Gatling Reports

Gatling generates HTML reports at build/reports/gatling/:

Request statistics
Response time distribution
Response time percentiles over time
Requests per second
Active users over time

k6 Output

{
  "metrics": {
    "http_req_duration": {
      "avg": 45.2,
      "min": 12.3,
      "max": 156.7,
      "p(90)": 89.4,
      "p(95)": 112.8
    },
    "http_reqs": 12543,
    "http_req_failed": 0.002
  }
}

Performance Optimization Strategies

1. Database

Add appropriate indexes
Optimize N+1 queries
Use connection pooling
Consider read replicas
Implement caching

2. Caching

Cache hot data (80/20 rule)
Set appropriate TTLs
Monitor cache hit ratio (target: > 80%)
Implement cache warming

3. HTTP/API

Use HTTP/2 for multiplexing
Implement pagination
Compress responses (gzip)
Use ETags for conditional requests
Implement rate limiting

4. Messaging

Batch message processing
Optimize consumer configuration
Monitor consumer lag
Use appropriate partitioning

5. JVM

Tune heap size (-Xms, -Xmx)
Monitor GC pauses
Use G1GC or ZGC for low latency
Enable JVM metrics

Troubleshooting

High Latency

Check database query performance
Look for N+1 query problems
Check cache hit ratio
Monitor GC pauses
Check for network issues
Look at distributed traces

High Error Rate

Check logs for error patterns
Monitor circuit breaker states
Check resource exhaustion
Verify connection pool settings
Look for cascading failures

Resource Exhaustion

Monitor heap usage
Check for connection leaks
Monitor thread pool utilization
Check disk I/O
Monitor network saturation

CI/CD Integration

Performance tests run in CI pipeline:

# .github/workflows/performance-smoke.yml
- name: Run k6 Smoke Test
  run: |
    cd bv-performance-testing/k6-scripts
    k6 run --out json=results.json api-smoke-test.js

- name: Check Performance Regression
  run: |
    # Compare with baseline
    # Fail if degradation > 10%

Best Practices

Always define baselines before optimization
Test one change at a time to isolate impact
Run tests multiple times to account for variance
Monitor system metrics during tests
Document findings for future reference
Update baselines after verified improvements
Test realistic scenarios not just synthetic loads
Include think time to simulate real users
Use production-like data volumes
Test failure scenarios not just happy paths

Learning Resources

Getting Help

Check bv-performance-testing/README.md
Review example tests in gatling-tests/src/gatling/java/
Look at SLI targets in performance-baselines/sli-targets.yaml
Ask in team channel with test results attached