Skip to content

Performance & Scalability — Microservices Interview

Target: Senior Engineer · Engineering Lead · Pre-Architect Focus: Bottleneck diagnosis, auto-scaling, latency optimization, caching


Q: Your system shows high latency only during peak hours. How do you identify the bottleneck?

Why interviewers ask this: Production latency issues are complex. Tests your ability to methodically diagnose across layers (network, service, database, JVM).

Answer

Diagnosis pyramid (test from top down):

Network latency (1-10ms)?
  ↓ DNS, TCP handshake, TLS
Service latency (10-100ms)?
  ↓ Handler logic, serialization
Database latency (50-500ms)?
  ↓ Query time, locks, I/O
JVM overhead (5-50ms)?
  ↓ GC pauses, thread contention

Tools & metrics:

Layer Tool Metric
End-to-end Distributed tracing (Jaeger) p50, p95, p99 latency per service
Database Slow query log, EXPLAIN PLAN Query time, lock waits
JVM -XX:+PrintGCDetails GC pause duration, frequency
System top, iostat, netstat CPU, memory, disk I/O, network
Thread pool Spring Boot Actuator Active threads, queue depth

Spring Boot diagnostic code:

@RestController
public class DiagnosticController {

    @GetMapping("/api/orders/{id}")
    public Order getOrder(@PathVariable String id) {
        long start = System.nanoTime();

        try {
            // Database call — measure separately
            long dbStart = System.nanoTime();
            Order order = orderRepository.findById(id).orElseThrow();
            long dbTime = System.nanoTime() - dbStart;

            log.info("Order lookup: {}ms", dbTime / 1_000_000);
            return order;
        } finally {
            long total = System.nanoTime() - start;
            log.info("Total latency: {}ms", total / 1_000_000);
        }
    }
}

Peak hours diagnosis checklist:

  • [ ] Distributed trace shows which service is slow
  • [ ] Database slow query log identifies problematic queries
  • [ ] jstat -gc shows if GC pauses spike during load
  • [ ] Thread pool metrics show saturation (queue_depth > 0)
  • [ ] Network latency within expected range (< 50ms)

Q: You notice uneven load distribution across instances. What could be wrong?

Answer

Load balancer issues:

Problem Sign Fix
Sticky sessions misconfigured Some instances get 80% traffic Remove session affinity or use shared session store (Redis)
Health check failing Healthy instance marked down Verify /health endpoint is working
Round-robin only No awareness of instance load Switch to least-connections or weighted algorithm
DNS caching Requests go to old instance Reduce DNS TTL, use service discovery
Colocation Instances on same physical host Check infrastructure layout, spread replicas

Kubernetes load balancing example:

apiVersion: v1
kind: Service
metadata:
  name: order-service
spec:
  selector:
    app: order-service
  type: ClusterIP
  sessionAffinity: None  # Disable sticky sessions
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  ports:
    - port: 80
      targetPort: 8080
  loadBalancerAlgorithm: leastconn  # Use least-connections

Monitoring:

Per-instance metrics:
- Instance A: 50% CPU, 8K req/sec
- Instance B: 25% CPU, 4K req/sec ← Uneven!
- Instance C: 75% CPU, 12K req/sec

Action: Check if Instance B is slow, remove from pool, rebalance

Q: A database becomes the bottleneck. How do you optimize?

Answer

Optimization hierarchy:

1. Query optimization
   - Add indexes, use EXPLAIN
   - Avoid N+1 queries
   - Batch operations

2. Caching
   - Redis for hot data
   - Cache-aside pattern
   - Invalidation strategy

3. Read replicas
   - Offload reads to read-only followers
   - Trade consistency for throughput

4. Sharding
   - Partition by tenant or key
   - Requires app-level routing

Query optimization checklist:

-- BEFORE (slow):
SELECT o.* FROM orders o
WHERE o.customer_id = ?;
-- No index → table scan

-- AFTER (fast):
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Now uses index → O(log N)

-- EXPLAIN shows:
Index Seek (good) vs Table Scan (bad)

Caching pattern:

@Cacheable(value = "products", key = "#productId")
public Product getProduct(String productId) {
    // Only called if cache miss
    return productRepository.findById(productId).orElseThrow();
}

@CacheEvict(value = "products", key = "#productId")
public void updateProduct(String productId, Product update) {
    productRepository.save(update);
}

Read replica routing:

@Repository
public class OrderRepository {

    // Write to primary
    public Order save(Order order) {
        return primaryDataSource.save(order);
    }

    // Read from replica
    public Optional<Order> findById(String id) {
        return replicaDataSource.findById(id);
    }
}

Q: A sudden traffic spike crashes services. How do you scale and stabilize?

Answer

Multi-layer response:

Spike detected (CPU > 80%, errors rising)?
├─ Immediate (< 1 sec)
│  ├─ Rate limiting: reject new requests
│  ├─ Load shedding: drop low-priority traffic
│  └─ Circuit breaker: stop calling failing services
├─ Short-term (10-60 sec)
│  ├─ Auto-scaling: spin up new pods
│  ├─ Message queue: buffer requests
│  └─ Cache: serve stale data
└─ Long-term (> 1 min)
   ├─ Database optimization
   ├─ Code profiling & optimization
   └─ Infrastructure changes

Kubernetes auto-scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100  # Double replicas
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50   # Reduce by 50%
          periodSeconds: 60

Diagram — Complete Scaling Architecture

graph LR
    Traffic["Traffic Spike\n100x normal"]
    RateLimit["Rate Limiter\n· Reject excess"]
    Queue["Message Queue\n· Buffer requests"]
    HPA["HPA\n· Scale 3→20 pods"]
    Cache["Cache\n· Serve stale data"]
    DB["Database\n· Read replicas"]

    Traffic -->|Phase 1: Block| RateLimit
    Traffic -->|Phase 2: Queue| Queue
    HPA -->|Phase 3: Scale| HPA
    Cache -->|Phase 4: Degrade| Cache
    DB -->|Phase 5: Distribute| DB

    style RateLimit fill:#ff6b6b
    style Queue fill:#ffe066
    style HPA fill:#51cf66
    style Cache fill:#4ecdc4