CI/CD & Team Leadership — Architect-Level Interview Guide
Target: Senior Engineer · Engineering Lead · Pre-Architect Focus: CI/CD pipelines, monolith migration, team dynamics, production incidents
Q: Design a CI/CD pipeline for a Spring Boot microservice. Walk me through the stages.
Why interviewers ask this: Tests real-world delivery experience — not just "do you know CI/CD" but can you design a production-grade pipeline.
Answer
Pipeline stages:
graph LR
Commit["Git Push\n· Feature branch"]
Build["Build\n· mvn compile\n· mvn test"]
Quality["Quality Gate\n· SonarQube\n· Coverage > 80%"]
Docker["Container Build\n· docker build\n· Trivy scan"]
IntTest["Integration Tests\n· Testcontainers\n· API contract tests"]
Staging["Deploy Staging\n· Helm upgrade\n· Smoke tests"]
Approval["Manual Approval\n· Team lead\n· Or auto if < N%"]
Prod["Deploy Prod\n· Canary 10%\n· Promote to 100%"]
Rollback["Rollback\n· Auto on error rate"]
Commit --> Build
Build -->|Tests pass| Quality
Quality -->|Gate pass| Docker
Docker -->|No CVEs| IntTest
IntTest -->|Pass| Staging
Staging -->|Smoke pass| Approval
Approval -->|Approved| Prod
Prod -->|Error rate high| Rollback
style Rollback fill:#ff6b6b
style Prod fill:#51cf66
Jenkins pipeline (Jenkinsfile):
pipeline {
agent any
environment {
IMAGE = "myrepo/order-service"
VERSION = "${env.GIT_COMMIT[0..7]}"
}
stages {
stage('Build & Unit Test') {
steps {
sh 'mvn clean verify -Dskip.integration=true'
junit 'target/surefire-reports/*.xml'
jacoco minimumLineCoverage: '80'
}
}
stage('Code Quality') {
steps {
withSonarQubeEnv('SonarCloud') {
sh 'mvn sonar:sonar'
}
timeout(time: 5, unit: 'MINUTES') {
waitForQualityGate abortPipeline: true
}
}
}
stage('Build & Scan Image') {
steps {
sh "docker build -t ${IMAGE}:${VERSION} ."
sh "trivy image --exit-code 1 --severity HIGH,CRITICAL ${IMAGE}:${VERSION}"
sh "docker push ${IMAGE}:${VERSION}"
}
}
stage('Integration Tests') {
steps {
// Testcontainers spins up real DB, Kafka, etc.
sh 'mvn verify -P integration-tests'
}
}
stage('Deploy to Staging') {
steps {
sh """
helm upgrade --install order-service ./helm/order-service \
--namespace staging \
--set image.tag=${VERSION} \
--wait --timeout 5m
"""
// Run smoke tests
sh 'newman run api-tests/smoke.postman_collection.json'
}
}
stage('Deploy to Production — Canary') {
when { branch 'main' }
steps {
input message: 'Deploy to production?', ok: 'Deploy'
sh """
helm upgrade order-service ./helm/order-service \
--namespace production \
--set image.tag=${VERSION} \
--set canary.enabled=true \
--set canary.weight=10
"""
// Monitor for 10 minutes, then promote or rollback
sh './scripts/canary-monitor.sh --threshold-error-rate=1 --duration=10m'
}
}
}
post {
failure { slackSend channel: '#deployments', message: "❌ Pipeline failed: ${env.JOB_NAME} ${VERSION}" }
success { slackSend channel: '#deployments', message: "✅ Deployed: ${env.JOB_NAME} ${VERSION}" }
}
}
Key principles:
| Principle | Implementation |
|---|---|
| Fail fast | Unit tests first, before expensive stages |
| Immutable artifacts | Image tag = git commit SHA, never latest |
| Security scanning | CVE scan before pushing to registry |
| Contract testing | Verify producer/consumer compatibility (Pact) |
| Progressive delivery | Canary → monitor → promote or rollback |
| Audit trail | Every deployment linked to git commit + who approved |
Q: How do you handle an outage in an upstream service your team doesn't own?
Why interviewers ask this: Tests maturity, communication skills, system thinking, and ownership mindset.
Answer
Immediate actions (first 15 minutes):
- Detect — alert fires, or user report. Open incident channel immediately.
- Assess blast radius — which of our user journeys are affected?
- Activate fallback — circuit breaker should already be tripping. Confirm it is.
- Verify it's not us — check our own error rates, deployments in last 2 hours.
- Contact their on-call — check their status page, open a P1 with their team.
- Communicate to stakeholders — brief, factual update every 15 minutes.
Technical response:
graph LR
Alert["Alert fires\n· High error rate"]
Diagnose["Diagnose\n· Our deploy? Their issue?"]
OurIssue["Our issue\n→ Rollback"]
TheirIssue["Upstream issue\n→ Activate fallback"]
Circuit["Circuit Breaker\n· Already open?"]
Degraded["Degraded mode\n· Stale data / partial results"]
Queue["Queue requests\n· Retry when upstream recovers"]
Comm["Communicate\n· Status page · Stakeholders"]
Alert --> Diagnose
Diagnose --> OurIssue
Diagnose --> TheirIssue
TheirIssue --> Circuit
Circuit -->|Yes| Degraded
Circuit -->|Not yet, manually open| Degraded
Degraded --> Queue
Degraded --> Comm
Degradation strategies by service type:
| Upstream service | Fallback strategy |
|---|---|
| Product catalog | Serve stale cached data (Redis, CDN) |
| Payment service | Queue orders, notify user of delayed processing |
| Recommendation engine | Return empty/default recommendations |
| User profile | Use cached profile; disable personalized features |
| Search service | Fall back to simple DB query |
Post-incident: - Root cause analysis within 48 hours - Add new observability: can we detect their degradation faster? - Review SLA: do we have contractual protection? - Consider adding a local cache/replica to reduce dependency
Architect Insight
The best engineers design systems that are resilient to upstream failures before they happen. If your system goes down whenever an upstream service does, the dependency is too tight. Ask: "If this service is down for 4 hours, what can we still do?" Design and implement that answer proactively.
Q: How do you onboard new engineers to a complex microservices system?
Answer
The problem: New engineers face an overwhelming number of services, repositories, conventions, and tools. A bad onboarding experience adds months to productivity.
A structured 30-60-90 day approach:
Week 1 — Read the docs, run the system locally:
- Maintain a ARCHITECTURE.md in the root repo: service map, team ownership, runbooks
- Provide a docker-compose.yml that spins up the full system locally
- Document the "golden path" — the simplest feature end-to-end
Week 2-3 — Guided contribution: - Assign a simple bug fix or small feature in a well-understood service - Pair programming sessions to explain conventions - Code review every PR with detailed explanations (not just "LGTM")
30 days — Unguided contribution: - Own a small feature from design to deployment - On-call shadow (observe, don't handle)
60 days — Production ownership: - Primary on-call for one service - Conduct a postmortem and present findings
Key artifacts to maintain:
| Document | Contents |
|---|---|
ARCHITECTURE.md |
Service map, team ownership, key tech choices |
| Service README | What it does, how to run, how to test, common issues |
| Runbook | How to diagnose and fix common alerts |
| ADR (Architecture Decision Records) | Why key decisions were made |
| API contracts | OpenAPI specs, event schemas |
Q: How do you prioritize fixing tech debt in a live microservices ecosystem?
Answer
Tech debt is not all equal. Categorize before prioritizing:
| Debt Type | Risk | Priority | Examples |
|---|---|---|---|
| Security vulnerabilities | 🔴 Critical | Fix immediately | CVEs, hardcoded secrets |
| Reliability debt | 🔴 High | Fix next sprint | No circuit breakers, no health checks |
| Scalability debt | 🟡 Medium | Plan this quarter | Synchronous chains, shared DB |
| Code quality debt | 🟢 Low | Continuous improvement | Duplication, poor naming |
| Documentation debt | 🟢 Low | Continuous | Missing runbooks, stale docs |
Framework — Tech Debt Backlog Scoring:
Priority Score = Impact × Likelihood × Effort_Inverse
Impact (1-5): How much does this hurt users or on-call?
Likelihood (1-5): How often does it cause problems?
Effort_Inverse: 5=quick fix, 1=months of work
Structural approach:
- Make it visible — add tech debt to the backlog, not a separate "debt board" that gets ignored
- 20% rule — reserve 20% of every sprint for tech debt (non-negotiable)
- Boy Scout Rule — leave code cleaner than you found it (refactor during feature work)
- Seize migration opportunities — when touching a module for a feature, refactor the surrounding debt
- Track debt metrics — SonarQube quality gate, test coverage trend, build time trend
What NOT to do: - ❌ "Debt sprint" — batching all debt into one sprint that never happens - ❌ Letting security CVEs sit in the backlog with low priority - ❌ Measuring debt by number of TODO comments (meaningless)
Q: How do you balance team autonomy with architectural consistency?
Answer
This is one of the core tensions in microservices organizations.
Too much autonomy: - Every service uses a different tech stack - No shared observability → debugging becomes a nightmare - Duplicated solutions to the same problems - New engineers lost — nothing is familiar
Too much control: - Slows down teams, creates bottlenecks at architecture review board - One-size-fits-all decisions that don't fit some teams - "Platform team" becomes a blocker
The solution — Paved Roads + Golden Paths:
Define what teams must do (non-negotiable standards) vs what teams choose (autonomous decisions):
| Must Do (Non-negotiable) | Can Choose (Autonomous) |
|---|---|
| Emit structured logs to centralized system | Log framework (Logback, Log4j2) |
Expose /actuator/health endpoints |
Framework (Spring Boot, Quarkus, Micronaut) |
| Use approved container base images | Internal framework version |
| Follow API contract versioning rules | API design style |
| Pass security scanning in CI | Test framework |
| Use approved message broker (Kafka) | Message serialization format |
Architecture Decision Records (ADRs): Document every architectural decision with context, decision, and consequences. Stored in Git. Teams can propose ADRs — the architecture guild approves.
# ADR-012: All services must expose Prometheus metrics via Micrometer
## Status: Accepted
## Context
We cannot consistently monitor 30+ services without a standard metrics format.
## Decision
All services must include `micrometer-registry-prometheus` and expose
`/actuator/prometheus`. Teams choose their own dashboards.
## Consequences
+ Unified metrics in Grafana across all services
- Teams using non-Spring frameworks need equivalent setup
Architect Insight
The Platform Team's job is to make the right thing easy and the wrong thing hard. Build internal libraries, templates, and Helm charts that embed the standards. If the opinionated path is also the easiest path, teams will follow it naturally without enforcement.