CI/CD & Team Leadership — Architect-Level Interview Guide

Target: Senior Engineer · Engineering Lead · Pre-Architect Focus: CI/CD pipelines, monolith migration, team dynamics, production incidents

Q: Design a CI/CD pipeline for a Spring Boot microservice. Walk me through the stages.

Why interviewers ask this: Tests real-world delivery experience — not just "do you know CI/CD" but can you design a production-grade pipeline.

Answer

Pipeline stages:

graph LR
    Commit["Git Push\n· Feature branch"]
    Build["Build\n· mvn compile\n· mvn test"]
    Quality["Quality Gate\n· SonarQube\n· Coverage > 80%"]
    Docker["Container Build\n· docker build\n· Trivy scan"]
    IntTest["Integration Tests\n· Testcontainers\n· API contract tests"]
    Staging["Deploy Staging\n· Helm upgrade\n· Smoke tests"]
    Approval["Manual Approval\n· Team lead\n· Or auto if < N%"]
    Prod["Deploy Prod\n· Canary 10%\n· Promote to 100%"]
    Rollback["Rollback\n· Auto on error rate"]

    Commit --> Build
    Build -->|Tests pass| Quality
    Quality -->|Gate pass| Docker
    Docker -->|No CVEs| IntTest
    IntTest -->|Pass| Staging
    Staging -->|Smoke pass| Approval
    Approval -->|Approved| Prod
    Prod -->|Error rate high| Rollback

    style Rollback fill:#ff6b6b
    style Prod fill:#51cf66

Jenkins pipeline (Jenkinsfile):

href="#__codelineno-0-1">pipeline { agent any environment { IMAGE = "myrepo/order-service" VERSION = "${env.GIT_COMMIT[0..7]}" } stages { stage('Build & Unit Test') { steps { sh 'mvn clean verify -Dskip.integration=true' junit 'target/surefire-reports/*.xml' jacoco minimumLineCoverage: '80' } } stage('Code Quality') { steps { withSonarQubeEnv('SonarCloud') { sh 'mvn sonar:sonar' } timeout(time: 5, unit: 'MINUTES') { waitForQualityGate abortPipeline: true } } } stage('Build & Scan Image') { steps { sh "docker build -t ${IMAGE}:${VERSION} ." sh "trivy image --exit-code 1 --severity HIGH,CRITICAL ${IMAGE}:${VERSION}" sh "docker push ${IMAGE}:${VERSION}" } } stage('Integration Tests') { steps { // Testcontainers spins up real DB, Kafka, etc. sh 'mvn verify -P integration-tests' } } stage('Deploy to Staging') { steps { sh """ helm upgrade --install order-service ./helm/order-service \ --namespace staging \ --set image.tag=${VERSION} \ --wait --timeout 5m """ // Run smoke tests sh 'newman run api-tests/smoke.postman_collection.json' } } stage('Deploy to Production — Canary') { when { branch 'main' } steps { input message: 'Deploy to production?', ok: 'Deploy' sh """ helm upgrade order-service ./helm/order-service \ --namespace production \ --set image.tag=${VERSION} \ --set canary.enabled=true \ --set canary.weight=10 """ // Monitor for 10 minutes, then promote or rollback sh './scripts/canary-monitor.sh --threshold-error-rate=1 --duration=10m' } } } post { failure { slackSend channel: '#deployments', message: "❌ Pipeline failed: ${env.JOB_NAME} ${VERSION}" } success { slackSend channel: '#deployments', message: "✅ Deployed: ${env.JOB_NAME} ${VERSION}" } } }

Key principles:

Principle	Implementation
Fail fast	Unit tests first, before expensive stages
Immutable artifacts	Image tag = git commit SHA, never `latest`
Security scanning	CVE scan before pushing to registry
Contract testing	Verify producer/consumer compatibility (Pact)
Progressive delivery	Canary → monitor → promote or rollback
Audit trail	Every deployment linked to git commit + who approved

Q: How do you handle an outage in an upstream service your team doesn't own?

Why interviewers ask this: Tests maturity, communication skills, system thinking, and ownership mindset.

Answer

Immediate actions (first 15 minutes):

Detect — alert fires, or user report. Open incident channel immediately.
Assess blast radius — which of our user journeys are affected?
Activate fallback — circuit breaker should already be tripping. Confirm it is.
Verify it's not us — check our own error rates, deployments in last 2 hours.
Contact their on-call — check their status page, open a P1 with their team.
Communicate to stakeholders — brief, factual update every 15 minutes.

Technical response:

graph LR
    Alert["Alert fires\n· High error rate"]
    Diagnose["Diagnose\n· Our deploy? Their issue?"]
    OurIssue["Our issue\n→ Rollback"]
    TheirIssue["Upstream issue\n→ Activate fallback"]
    Circuit["Circuit Breaker\n· Already open?"]
    Degraded["Degraded mode\n· Stale data / partial results"]
    Queue["Queue requests\n· Retry when upstream recovers"]
    Comm["Communicate\n· Status page · Stakeholders"]

    Alert --> Diagnose
    Diagnose --> OurIssue
    Diagnose --> TheirIssue
    TheirIssue --> Circuit
    Circuit -->|Yes| Degraded
    Circuit -->|Not yet, manually open| Degraded
    Degraded --> Queue
    Degraded --> Comm

Degradation strategies by service type:

Upstream service	Fallback strategy
Product catalog	Serve stale cached data (Redis, CDN)
Payment service	Queue orders, notify user of delayed processing
Recommendation engine	Return empty/default recommendations
User profile	Use cached profile; disable personalized features
Search service	Fall back to simple DB query

Post-incident: - Root cause analysis within 48 hours - Add new observability: can we detect their degradation faster? - Review SLA: do we have contractual protection? - Consider adding a local cache/replica to reduce dependency

Architect Insight

The best engineers design systems that are resilient to upstream failures before they happen. If your system goes down whenever an upstream service does, the dependency is too tight. Ask: "If this service is down for 4 hours, what can we still do?" Design and implement that answer proactively.

Q: How do you onboard new engineers to a complex microservices system?

Answer

The problem: New engineers face an overwhelming number of services, repositories, conventions, and tools. A bad onboarding experience adds months to productivity.

A structured 30-60-90 day approach:

Week 1 — Read the docs, run the system locally: - Maintain a ARCHITECTURE.md in the root repo: service map, team ownership, runbooks - Provide a docker-compose.yml that spins up the full system locally - Document the "golden path" — the simplest feature end-to-end

Week 2-3 — Guided contribution: - Assign a simple bug fix or small feature in a well-understood service - Pair programming sessions to explain conventions - Code review every PR with detailed explanations (not just "LGTM")

30 days — Unguided contribution: - Own a small feature from design to deployment - On-call shadow (observe, don't handle)

60 days — Production ownership: - Primary on-call for one service - Conduct a postmortem and present findings

Key artifacts to maintain:

Document	Contents
`ARCHITECTURE.md`	Service map, team ownership, key tech choices
Service README	What it does, how to run, how to test, common issues
Runbook	How to diagnose and fix common alerts
ADR (Architecture Decision Records)	Why key decisions were made
API contracts	OpenAPI specs, event schemas

Q: How do you prioritize fixing tech debt in a live microservices ecosystem?

Answer

Tech debt is not all equal. Categorize before prioritizing:

Debt Type	Risk	Priority	Examples
Security vulnerabilities	🔴 Critical	Fix immediately	CVEs, hardcoded secrets
Reliability debt	🔴 High	Fix next sprint	No circuit breakers, no health checks
Scalability debt	🟡 Medium	Plan this quarter	Synchronous chains, shared DB
Code quality debt	🟢 Low	Continuous improvement	Duplication, poor naming
Documentation debt	🟢 Low	Continuous	Missing runbooks, stale docs

Framework — Tech Debt Backlog Scoring:

Priority Score = Impact × Likelihood × Effort_Inverse

Impact (1-5):     How much does this hurt users or on-call?
Likelihood (1-5): How often does it cause problems?
Effort_Inverse:   5=quick fix, 1=months of work

Structural approach:

Make it visible — add tech debt to the backlog, not a separate "debt board" that gets ignored
20% rule — reserve 20% of every sprint for tech debt (non-negotiable)
Boy Scout Rule — leave code cleaner than you found it (refactor during feature work)
Seize migration opportunities — when touching a module for a feature, refactor the surrounding debt
Track debt metrics — SonarQube quality gate, test coverage trend, build time trend

What NOT to do: - ❌ "Debt sprint" — batching all debt into one sprint that never happens - ❌ Letting security CVEs sit in the backlog with low priority - ❌ Measuring debt by number of TODO comments (meaningless)

Q: How do you balance team autonomy with architectural consistency?

Answer

This is one of the core tensions in microservices organizations.

Too much autonomy: - Every service uses a different tech stack - No shared observability → debugging becomes a nightmare - Duplicated solutions to the same problems - New engineers lost — nothing is familiar

Too much control: - Slows down teams, creates bottlenecks at architecture review board - One-size-fits-all decisions that don't fit some teams - "Platform team" becomes a blocker

The solution — Paved Roads + Golden Paths:

Define what teams must do (non-negotiable standards) vs what teams choose (autonomous decisions):

Must Do (Non-negotiable)	Can Choose (Autonomous)
Emit structured logs to centralized system	Log framework (Logback, Log4j2)
Expose `/actuator/health` endpoints	Framework (Spring Boot, Quarkus, Micronaut)
Use approved container base images	Internal framework version
Follow API contract versioning rules	API design style
Pass security scanning in CI	Test framework
Use approved message broker (Kafka)	Message serialization format

Architecture Decision Records (ADRs): Document every architectural decision with context, decision, and consequences. Stored in Git. Teams can propose ADRs — the architecture guild approves.

# ADR-012: All services must expose Prometheus metrics via Micrometer

## Status: Accepted

## Context
We cannot consistently monitor 30+ services without a standard metrics format.

## Decision
All services must include `micrometer-registry-prometheus` and expose
`/actuator/prometheus`. Teams choose their own dashboards.

## Consequences
+ Unified metrics in Grafana across all services
- Teams using non-Spring frameworks need equivalent setup

Architect Insight

The Platform Team's job is to make the right thing easy and the wrong thing hard. Build internal libraries, templates, and Helm charts that embed the standards. If the opinionated path is also the easiest path, teams will follow it naturally without enforcement.