Circuit Breaker — Deep Dive
Level: Intermediate
Pre-reading: 06 · Resilience & Reliability
What is a Circuit Breaker?
A circuit breaker prevents cascading failures by failing fast when a downstream service is unhealthy. Like an electrical circuit breaker, it "trips" when problems are detected.
stateDiagram-v2
[*] --> Closed
Closed --> Open: Failure threshold exceeded
Open --> HalfOpen: Timeout elapsed
HalfOpen --> Closed: Probe succeeds
HalfOpen --> Open: Probe fails
Circuit Breaker States
| State | Behavior | Transition |
|---|---|---|
| Closed | Requests flow through; failures counted | → Open when threshold exceeded |
| Open | Requests fail immediately; no calls made | → Half-Open after timeout |
| Half-Open | Limited probe requests sent | → Closed if success; → Open if fail |
Configuration Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| Failure rate threshold | % of failures to trip circuit | 50% |
| Slow call rate threshold | % of slow calls to trip | 80% |
| Slow call duration | Definition of "slow" | 2 seconds |
| Minimum calls | Min calls before evaluating | 10 |
| Sliding window size | Time or count window | 100 calls or 10 seconds |
| Wait duration in open | Time before half-open | 60 seconds |
| Permitted calls in half-open | Probe requests | 10 |
Resilience4j Implementation
Configuration
@Configuration
public class CircuitBreakerConfig {
@Bean
public CircuitBreakerRegistry circuitBreakerRegistry() {
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(80)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(60))
.permittedNumberOfCallsInHalfOpenState(10)
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(100)
.minimumNumberOfCalls(10)
.build();
return CircuitBreakerRegistry.of(config);
}
}
Usage with Annotation
@Service
public class OrderService {
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(Order order) {
return paymentClient.charge(order);
}
private PaymentResult paymentFallback(Order order, Exception e) {
log.warn("Payment service unavailable, queuing order: {}", order.getId());
paymentQueue.enqueue(order);
return PaymentResult.pending();
}
}
Programmatic Usage
CircuitBreaker circuitBreaker = registry.circuitBreaker("paymentService");
Supplier<PaymentResult> decorated = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentClient.charge(order));
Try<PaymentResult> result = Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class, e -> PaymentResult.pending());
Sliding Window Types
Count-Based
Track last N calls regardless of time.
Time-Based
Track calls within a time window.
| Type | Use When |
|---|---|
| Count-based | Consistent traffic volume |
| Time-based | Variable traffic; want time-based evaluation |
Circuit Breaker vs Retry
| Pattern | Purpose | When to Use |
|---|---|---|
| Retry | Recover from transient failures | Temporary issues (network blip) |
| Circuit Breaker | Prevent calling failing service | Sustained failures (service down) |
Use together: Retry first, circuit breaker wraps retry.
@CircuitBreaker(name = "paymentService")
@Retry(name = "paymentService")
public PaymentResult processPayment(Order order) {
return paymentClient.charge(order);
}
Order of execution: Retry → Circuit Breaker → Bulkhead → Rate Limiter
Monitoring
Metrics (Prometheus)
# Circuit breaker state
resilience4j_circuitbreaker_state{name="paymentService"} 0 # 0=closed, 1=open, 2=half-open
# Failure rate
resilience4j_circuitbreaker_failure_rate{name="paymentService"} 5.5
# Calls
resilience4j_circuitbreaker_calls_total{name="paymentService", kind="successful"} 1000
resilience4j_circuitbreaker_calls_total{name="paymentService", kind="failed"} 50
resilience4j_circuitbreaker_calls_total{name="paymentService", kind="not_permitted"} 200
Events
circuitBreaker.getEventPublisher()
.onStateTransition(event ->
log.info("Circuit breaker {} transitioned from {} to {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState()));
Fallback Strategies
| Strategy | Example |
|---|---|
| Default value | Return empty list, default price |
| Cached response | Return last known good response |
| Graceful degradation | Skip non-critical feature |
| Queue for later | Save request for retry |
| Alternative service | Call backup service |
Fallback Example
@CircuitBreaker(name = "recommendations", fallbackMethod = "recommendationsFallback")
public List<Product> getRecommendations(String userId) {
return recommendationService.getPersonalized(userId);
}
// Fallback returns cached popular products
private List<Product> recommendationsFallback(String userId, Exception e) {
return cache.get("popular-products", () -> productService.getPopular());
}
Circuit Breaker Patterns
Per-Service Circuit Breaker
One circuit breaker per downstream service.
@CircuitBreaker(name = "paymentService")
public PaymentResult charge(Order order) { ... }
@CircuitBreaker(name = "inventoryService")
public boolean reserve(Order order) { ... }
Per-Operation Circuit Breaker
Different thresholds for different operations.
CircuitBreakerConfig criticalConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(25) // More sensitive
.waitDurationInOpenState(Duration.ofSeconds(30))
.build();
CircuitBreakerConfig normalConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(60))
.build();
Service Mesh Circuit Breaking
Istio/Envoy provides circuit breaking at the infrastructure level.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
| Aspect | Application Level | Service Mesh |
|---|---|---|
| Language | Requires library | Language agnostic |
| Granularity | Per method/operation | Per service |
| Fallback logic | Yes | No (just fail) |
| Configuration | In code/config | Kubernetes resources |
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Threshold too high | Circuit never trips | Lower to 50% or less |
| Threshold too low | False positives | Increase minimum calls |
| No fallback | Error propagates | Always provide fallback |
| Same config for all | Suboptimal behavior | Tune per service |
| Not monitoring | Don't know state | Add metrics and alerts |
What's the difference between circuit breaker and retry?
Retry handles transient failures by repeating the request. Circuit breaker handles sustained failures by failing fast and not calling the unhealthy service. Use retry for temporary issues; circuit breaker prevents overwhelming a struggling service. Use both together: retry wraps the call, circuit breaker wraps retry.
How do you tune circuit breaker thresholds?
Start with defaults (50% failure rate, 60s open duration). Monitor in production. If false positives: increase minimum calls or threshold. If slow to trip: decrease threshold. If too long in open: decrease wait duration. Tune per service based on its SLA and failure patterns.
Should you use application-level or service mesh circuit breakers?
Use both. Service mesh (Istio) provides baseline protection without code changes and works across languages. Application-level (Resilience4j) provides fallback logic and fine-grained control. Mesh catches infrastructure failures; application handles business logic degradation.