Datadog Integration
Overview
Datadog APM (Application Performance Monitoring) allows you to:
- Monitor your system during load tests in real-time
- Correlate application metrics with load test metrics
- Identify bottlenecks at the service level
- Trace individual requests end-to-end
- Alert when performance degradations occur
Why Monitor Load Tests?
Without Monitoring
Gatling Report: p95 = 5000ms
❓ Why is latency so high?
❓ Is it the database? The API gateway? Network?
❓ Where should I optimize?
With Datadog
Gatling Report: p95 = 5000ms
Datadog shows: Database queries taking 4500ms
→ Root cause: N+1 query problem
→ Action: Add query cache
Setup
1. Install Datadog Agent
2. Set Datadog API Key
# Get API key from Datadog dashboard
# Set environment variable
export DATADOG_API_KEY="your-api-key-here"
3. Enable Traces
# In Datadog Agent config
# Edit: /opt/datadog-agent/etc/datadog.yaml
apm_enabled: true
apm_port: 8126
Instrument Your Application
For Java Applications
Add dependency to pom.xml:
<dependency>
<groupId>com.datadoghq</groupId>
<artifactId>dd-java-agent</artifactId>
<version>1.20.0</version>
</dependency>
Run with Agent
java -javaagent:/path/to/dd-java-agent.jar \
-Ddd.service=my-api \
-Ddd.env=staging \
-Ddd.trace.sample.rate=1.0 \
-jar my-app.jar
Tagging Best Practices
Add custom tags to correlate with Gatling tests:
# Tag your test run
export DD_TAGS="env:staging,test:load-test-001,version:1.0"
# Tag in Gatling
// In your simulation
System.setProperty("dd.tags", "env:staging,test_name:Lab1_BasicHttp");
Metrics to Monitor
System Metrics
- CPU Usage: Should increase with load
- Memory: Monitor for leaks
- Disk I/O: Database bottleneck indicator
- Network: Bandwidth saturation
Application Metrics
- Requests per second: RPS
- Error rate: Should stay <1%
- P95 latency: Response time tail latency
- Apdex: User satisfaction score
Database Metrics
- Query latency: Breakdown by query
- Connection pool: Exhaustion indicator
- Slow queries: >1000ms queries
- Lock contention: Concurrent access issues
Creating Dashboards
Custom Dashboard Example
1. Database Performance
├─ Avg Query Time (by query type)
├─ Query Count (by table)
├─ Connection Pool Usage
└─ Slow Query Alert
2. API Performance
├─ Requests per Second
├─ P95 Latency (by endpoint)
├─ Error Rate
└─ Status Code Distribution
3. System Resources
├─ CPU %
├─ Memory %
├─ Disk I/O
└─ Network Bandwidth
Key Metrics During Load Test
| Metric | Baseline | Under Load | Action |
|---|---|---|---|
| p95 latency | 100ms | <500ms | Good |
| p95 latency | 100ms | >2000ms | Investigate |
| Error rate | <0.1% | <1% | Acceptable |
| Error rate | <0.1% | >5% | Critical |
| CPU | 20% | <80% | Good |
| CPU | 20% | >95% | Bottleneck |
| Memory | 2GB | <6GB | Good |
| Memory | 2GB | >8GB | Leak? |
Key Takeaways
- Datadog provides context for Gatling metrics
- Tagging links tests to metrics
- Dashboards enable quick analysis
- Alerting catches regressions
- Traces reveal bottlenecks
Navigation
← Previous: Lab 8: Advanced Patterns
→ Next: Traces, Operations & Spans
↑ Up: Documentation Index