Microservices & Spring Interview Questions
Level: Intermediate to Advanced
Resilience & Fault Tolerance — Microservices Interview
Target: Senior Engineer · Engineering Lead · Pre-Architect Focus: Circuit breakers, bulkheads, fallbacks, cascading failures, resilience patterns
Q: A microservice becomes slow and starts impacting all dependent services. How do you prevent cascading failures?
Why interviewers ask this: Cascading failures are one of the most costly production incidents. Tests understanding of failure isolation and early detection.
Answer
When Service A gets slow, all clients of Service A wait longer → their timeouts expire → they retry → more load on Service A → even slower → cascading across the entire system.
Prevention strategy:
graph LR
Client["Client"]
Circuit["Circuit Breaker\nDetect slow"]
SlowSvc["Slow Service A"]
Fallback["Fallback\nCached/Degraded"]
Bulkhead["Bulkhead\nThread Pool"]
Client -->|1 · Request| Circuit
Circuit -->|2 · Detect 50% failure| Circuit
Circuit -->|3a · Trip open| Fallback
Circuit -->|3b · Before trip| SlowSvc
SlowSvc -.->|Slow 500ms| SlowSvc
Fallback -->|Return stale cache| Client
Bulkhead -->|Limit threads · 10| SlowSvc
Multi-layer defense:
| Layer | Tool | Mechanism |
|---|---|---|
| Circuit Breaker | Resilience4j, Hystrix | Detect slow svc, fail fast, prevent retry storms |
| Timeout | Spring RestTemplate | Don't wait forever — kill request after Nth ms |
| Bulkhead | Thread pool isolation | Limit threads per downstream svc — prevent pool exhaustion |
| Rate Limiter | Resilience4j | Cap requests to failing service, protect its recovery |
| Fallback | Custom logic | Return cached/stale data instead of error |
Spring Boot implementation:
@Service
public class OrderClient {
@CircuitBreaker(name = "orderService", fallbackMethod = "orderFallback")
@Retry(name = "orderService", fallbackMethod = "orderFallback")
@Bulkhead(name = "orderService")
@TimeLimiter(name = "orderService")
public CompletableFuture<Order> getOrder(String orderId) {
return CompletableFuture.supplyAsync(() ->
webClient.get()
.uri("/orders/{id}", orderId)
.retrieve()
.bodyToMono(Order.class)
.block() // blocking for simplicity
);
}
public CompletableFuture<Order> orderFallback(String orderId, Exception e) {
log.warn("OrderService slow/down, returning cached", e);
return CompletableFuture.completedFuture(orderCache.get(orderId));
}
}
Configuration:
resilience4j:
circuitbreaker:
instances:
orderService:
slidingWindowSize: 10 # Watch last 10 calls
failureRateThreshold: 50 # Open if 50% fail
waitDurationInOpenState: 10s # Wait before half-open
timelimiter:
instances:
orderService:
timeoutDuration: 2s # Kill after 2 sec
bulkhead:
instances:
orderService:
maxThreadPoolSize: 10 # Max 10 parallel threads
Architect Insight
Circuit breaker isn't about preventing failures — it's about detecting failures early and failing fast before cascading. A 500ms timeout that keeps the service alive is better than a 30-second timeout that suffocates the entire system.
Q: How do you design fallback mechanisms for high-latency dependencies?
Answer
A fallback is a backup behavior when the primary operation fails or times out. Design fallbacks as a hierarchy:
Fallback hierarchy (try in order):
1. Cached/stale data (best UX — might be slightly outdated)
2. Partial results (reduced functionality, not zero value)
3. Default value (safe but degraded)
4. Error to user (last resort)
Examples:
| Service | Fallback Strategy |
|---|---|
| Product Catalog | Serve stale cached products (from Redis) |
| Recommendation | Return empty list (user sees less personalization) |
| Payment | Queue for async processing, notify user of delay |
| User Profile | Use cached/default values, disable personalized features |
| Search | Fallback to simple DB query (slower but works) |
Implementation:
public ProductInfo getProductWithFallback(String productId) {
try {
return catalogServiceClient.getProduct(productId); // 500ms timeout
} catch (TimeoutException e) {
// Try cache first
Optional<ProductInfo> cached = cache.get(productId);
if (cached.isPresent()) {
log.warn("Catalog timeout, serving cached product", e);
return cached.get();
}
// Cache miss — return degraded response
log.warn("Catalog timeout and cache miss, returning defaults");
return ProductInfo.degraded(productId);
}
}
Common Mistake
Don't use fallbacks for all errors indiscriminately. Fallback to stale product data = OK. Fallback to stale payment status = TERRIBLE (wrong balance). Know which failures can safely degrade.
Q: A downstream service becomes unavailable frequently. How do you ensure resilience?
Answer
If a service fails often, you need health-aware routing + automatic failover + messaging-based decoupling.
Strategies:
Unavailable service?
├─ Active-active (multiple instances)
│ ├─ Health checks → route away from failing instance
│ └─ Load balancer removes unhealthy from pool
├─ Read replicas (read-only)
│ ├─ Use read-only replica if primary down
│ └─ Trade: eventual consistency
├─ Async + queuing
│ ├─ Don't call synchronously
│ ├─ Queue work, process when service is back
│ └─ Service becomes optional for happy path
└─ Multi-region failover
├─ Primary region down → fail to secondary
└─ Trade: latency, cost, complexity
Kubernetes example — health checks:
apiVersion: apps/v1
kind: Deployment
metadata:
name: inventory-service
spec:
replicas: 3 # Multiple instances
template:
spec:
containers:
- name: inventory
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
failureThreshold: 3
periodSeconds: 10
# After 3 failures = restart this pod
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
failureThreshold: 2
periodSeconds: 5
# After 2 failures = remove from load balancer
Q: How do you design for high availability when dependencies fail?
Answer
Design the service to degrade gracefully, not fail completely:
graph TD
Request["Incoming Request"]
Happy["Happy Path\n· All dependencies up"]
Degraded["Degraded Mode\n· Some deps down"]
Request --> Happy
Request --> Degraded
Happy -->|Cache hit| FastPath["Return immediately\n· < 50ms"]
Happy -->|DB + Cache + External API| NormalPath["Normal path\n· 200-500ms"]
Degraded -->|Cache + local replica| DegradedPath["Reduced functionality\n· Read-only mode"]
Degraded -->|Return defaults| MinPath["Minimal mode\n· Core features only"]
style FastPath fill:#51cf66
style NormalPath fill:#4ecdc4
style DegradedPath fill:#ffe066
style MinPath fill:#ff9999
Key tactics:
- Local cache — keep recent data locally; use if upstream is down
- Read replicas — maintain a local copy of critical data
- Async queuing — don't block on slow operations
- Feature flags — disable non-critical features when downstream is down
- Bulkheads — isolation ensures one slow service doesn't block others
Q: How do you ensure idempotency across distributed calls?
Answer
Idempotency = calling the same operation multiple times produces the same result as once. Critical for safe retries in distributed systems.
Three levels:
| Level | Implementation | Example |
|---|---|---|
| Client | Generate UUID for each logical operation | Idempotency-Key: 550e8400... header |
| Network | Retry framework transparently resends | Resilience4j, gRPC built-in retries |
| Server | Detect duplicates, return cached result | Check idempotency key in DB before processing |
Flow:
Client sends: POST /payments?Idempotency-Key=UUID-1
Server processes → stores result → returns response
Network timeout → Client retries same request
Server: "Key UUID-1 already processed" → returns cached result
Result: Payment processed once, not twice ✓
Spring implementation:
@PostMapping("/payments")
public PaymentResponse processPayment(
@RequestBody PaymentRequest req,
@RequestHeader("Idempotency-Key") String key) {
// Step 1: Check if already processed
Optional<PaymentResponse> cached = idempotencyStore.get(key);
if (cached.isPresent()) {
return cached.get();
}
// Step 2: Process and store result atomically
PaymentResponse response = paymentGateway.charge(req.amount);
idempotencyStore.put(key, response);
return response;
}
Q: How do you handle partial failures gracefully?
Answer
A batch operation might succeed partially (3 of 5 items process, 2 fail). Design to handle this:
API Design:
POST /orders/batch
[
{id: 1, qty: 5},
{id: 2, qty: 3},
{id: 3, qty: 7}
]
Response: 207 Multi-Status
{
"succeeded": [
{id: 1, orderId: "ORD-123"}
],
"failed": [
{id: 2, reason: "Out of stock"},
{id: 3, reason: "Invalid quantity"}
]
}
Implementation:
@PostMapping("/orders/batch")
public ResponseEntity<BatchResponse> createOrdersBatch(@RequestBody List<OrderRequest> requests) {
BatchResponse response = new BatchResponse();
for (OrderRequest req : requests) {
try {
Order order = orderService.createOrder(req);
response.addSuccess(order);
} catch (OutOfStockException e) {
response.addFailure(req.id, e.getMessage());
}
}
return ResponseEntity
.status(response.hasFailures() ? HttpStatus.MULTI_STATUS : HttpStatus.OK)
.body(response);
}
Diagram — Complete Resilience Architecture
graph LR
Client["Client"]
Gateway["API Gateway\n· Timeout: 5s\n· Rate limit: 100/sec"]
ServiceA["Service A\n· Circuit Breaker\n· Timeout: 2s"]
ServiceB["Service B\n· Bulkhead: 10 threads\n· Fallback: cache"]
Database["Database\n· Read replicas\n· Local cache"]
Queue["Message Queue\n· Async processing"]
Client -->|HTTP| Gateway
Gateway -->|Sync call| ServiceA
Gateway -->|Sync call| ServiceB
ServiceB -->|Query| Database
ServiceA -->|Async| Queue
Database -->|Local cache| ServiceB
style Gateway fill:#4ecdc4
style ServiceA fill:#51cf66
style ServiceB fill:#ffe066
style Queue fill:#ff9999