Service Communication — Microservices Interview
Target: Senior Engineer · Engineering Lead · Pre-Architect Focus: gRPC vs REST, reliability patterns, API versioning, service mesh
Q: How do you ensure reliable inter-service communication with retries and timeouts?
Why interviewers ask this: Network failures are inevitable. Tests understanding of timeout strategies, exponential backoff, and when retries are safe.
Answer
The problem: Networks are unreliable. A 99.9% reliable service calling 10 services = 99% overall reliability (all succeed). Each retry risks cascading failures.
Solution hierarchy:
Failure handling order:
1. Timeout — don't wait forever (2-5 sec typical)
2. Detect failure → fast-fail
3. Retry with backoff (exponential: 100ms → 200ms → 400ms)
4. Circuit breaker — stop retrying if service is down
5. Fallback — return degraded response
Spring Boot example:
@Service
public class PaymentClient {
@Retry(name = "payment", fallbackMethod = "paymentFallback")
@CircuitBreaker(name = "payment")
@TimeLimiter(name = "payment")
public CompletableFuture<PaymentResponse> charge(PaymentRequest req) {
return CompletableFuture.supplyAsync(() ->
webClient.post()
.uri("http://payment-service/charge")
.bodyValue(req)
.retrieve()
.bodyToMono(PaymentResponse.class)
.timeout(Duration.ofSeconds(2))
.block()
);
}
public CompletableFuture<PaymentResponse> paymentFallback(
PaymentRequest req, Exception ex) {
// Return degraded response
return CompletableFuture.completedFuture(
PaymentResponse.pending(req.orderId)
);
}
}
Configuration:
resilience4j:
retry:
instances:
payment:
max-attempts: 3
wait-duration: 100ms
exponential-backoff-multiplier: 2
retry-exceptions:
- java.net.ConnectException
- java.io.IOException
timelimiter:
instances:
payment:
timeout-duration: 2s
Backoff strategy:
Attempt 1: immediate
Attempt 2: wait 100ms
Attempt 3: wait 200ms
Total max wait: 300ms before failing
Common Mistake
Retry on any exception = disaster. SocketTimeoutException → safe to retry. PaymentAlreadyProcessedException → fail immediately. Design idempotent operations before enabling retries on mutations.
Q: When should you use gRPC vs REST for inter-service communication?
Why interviewers ask this: Tech choice has cascading implications for performance, debugging, and team expertise. Tests architectural trade-off thinking.
Answer
| Criteria | REST | gRPC |
|---|---|---|
| Serialization | JSON (text) | Protocol Buffers (binary) |
| Size | Large (~200 bytes) | Small (~50 bytes) — 4x smaller |
| Speed | Slower parsing | Fast binary parsing |
| Protocol | HTTP/1.1 (one request at a time) | HTTP/2 (multiplexed) |
| Streaming | No native support | Bidirectional streaming |
| Debugging | Easy (curl, browser) | Harder (need grpcurl) |
| Ecosystem | Mature, widely supported | Growing but less mature |
| Browser clients | ✅ Native | ❌ Requires gRPC-Web proxy |
| Use case | Public APIs, third-party clients | Internal service-to-service |
Performance comparison:
Service A → Service B (processing 10,000 requests)
REST:
- Request size: 200 bytes × 10k = 2 MB
- Response size: 300 bytes × 10k = 3 MB
- Total: 5 MB over network
- Latency: ~50 ms per request (JSON parsing)
gRPC:
- Request size: 50 bytes × 10k = 500 KB
- Response size: 75 bytes × 10k = 750 KB
- Total: 1.25 MB over network (-75%)
- Latency: ~5 ms per request (binary parsing, HTTP/2)
- 10x faster per request
gRPC service definition:
syntax = "proto3";
service OrderService {
rpc GetOrder(OrderId) returns (Order);
rpc CreateOrder(OrderRequest) returns (OrderResponse);
// Bidirectional streaming
rpc ProcessOrders(stream Order) returns (stream OrderStatus);
}
message Order {
string id = 1;
int64 amount = 2;
string status = 3;
}
gRPC + Spring Boot:
@GrpcService
public class OrderServiceImpl extends OrderServiceGrpc.OrderServiceImplBase {
@Override
public void getOrder(OrderId request,
StreamObserver<Order> responseObserver) {
Order order = orderService.findById(request.getId());
responseObserver.onNext(order);
responseObserver.onCompleted();
}
}
Recommendation: - Use REST — Public APIs, third-party clients, simplicity required - Use gRPC — Internal service-to-service (10+ services), high throughput, streaming needs - Hybrid — REST for edge (API Gateway) → gRPC internally
Q: How do you implement API versioning without breaking clients?
Answer
Three versioning strategies:
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| URL path | /v1/orders, /v2/orders |
Explicit, works with proxies | Duplicate code, routing complexity |
| Header | Accept: application/json; version=2 |
Single codebase, clean URLs | Less discoverable, cache issues |
| Query param | /orders?api_version=2 |
Flexible, easy to test | Cache-key issues, non-standard |
Best practice — URL path with header fallback:
@RestController
@RequestMapping("/api")
public class OrderController {
@GetMapping(
"/v1/orders/{id}",
produces = "application/json"
)
public OrderV1 getOrderV1(@PathVariable String id) {
Order order = orderService.findById(id);
// Map internal model to V1 (no new fields)
return OrderV1.from(order);
}
@GetMapping(
"/v2/orders/{id}",
produces = "application/json"
)
public OrderV2 getOrderV2(@PathVariable String id) {
Order order = orderService.findById(id);
// Map to V2 (includes new fields, e.g., "estimatedDelivery")
return OrderV2.from(order);
}
}
API evolution best practices:
- Add fields, never remove — Old clients ignore unknown fields
- Default old fields — Always include fields from V1 in V2
- Deprecation timeline — Announce 6–12 months before retiring API version
- Semantic versioning — MAJOR.MINOR.PATCH (v1, v2, v1.1)
Deprecation example:
2024-01: Release /v2/orders (new field: tracking_id)
2025-01: Announce deprecation of /v1/orders
2025-07: Retire /v1/orders (clients must migrate)
Q: Should you implement a service mesh (Istio, Linkerd)?
Why interviewers ask this: Service mesh is a major operational investment. Tests cost/benefit thinking and maturity assessment.
Answer
Service mesh = sidecar proxies + control plane that handle: - Retries, circuit breaking, timeouts (without code changes) - Load balancing, traffic splitting (canary deployments) - mTLS encryption between all services - Distributed tracing, observability
Decision matrix:
| Organization Stage | Use Service Mesh? | Why |
|---|---|---|
| < 5 microservices | ❌ No | Overkill — use libraries (Resilience4j) |
| 5-20 microservices | ⚠️ Maybe | Only if you have Kubernetes + experienced ops team |
| 20-50 microservices | ✅ Yes | Centralized policy enforcement pays off |
| 50+ microservices | ✅ Definitely | Manual resilience in each service becomes unmaintainable |
Tradeoffs:
| Aspect | Pro | Con |
|---|---|---|
| Resilience | Automatic circuit breakers, retries in proxy | Added complexity, learning curve |
| Observability | Automatic tracing, metrics without code changes | More infrastructure to operate |
| Overhead | Centralized policy, no library duplication | ~10% latency penalty, memory per pod |
| Debugging | Flow is visible in mesh | More tools to learn (istioctl) |
Recommendation: - Start with Resilience4j libraries (simpler, less overhead) - Migrate to service mesh when you have 15+ services AND a dedicated platform team - Use managed service mesh (AWS App Mesh, Google Anthos) to reduce operational burden
Diagram — Complete Service Communication Architecture
graph LR
Client["Client"]
Gateway["API Gateway\n· REST /v2/\n· Rate limit"]
SvcA["Service A"]
SvcB["Service B"]
Mesh["Service Mesh · Istio\n· mTLS\n· Circuit breaker\n· Tracing"]
Client -->|REST\nHTTP/1.1| Gateway
Gateway -->|gRPC\nHTTP/2| SvcA
SvcA -->|gRPC\nHTTP/2| SvcB
Mesh -.->|Sidecar proxy\nretry + circuit breaker| SvcA
Mesh -.->|Sidecar proxy\nretry + circuit breaker| SvcB
style Gateway fill:#4ecdc4
style SvcA fill:#51cf66
style SvcB fill:#ffe066
style Mesh fill:#9b59b6