Design a CI/CD Deployment Pipeline¶
Interview Time: 60 min | Difficulty: Medium
Key Focus: Build automation, testing, deployment orchestration, rollback
Step 1: Functional & Non-Functional Requirements¶
Functional Requirements¶
- Developer pushes code to git repo (GitHub/GitLab)
- Automatic build trigger (compile, unit tests, linting)
- Run integration tests and code coverage checks
- Build Docker images, push to container registry
- Deploy to staging environment, run smoke tests
- Deploy to production with canary/blue-green strategy
- Monitor metrics, auto-rollback on errors
- Support feature flags for gradual rollout
- Store build artifacts, logs, and deployment history
- Notify team on build / deployment status
- Support manual approvals before prod deployment
Non-Functional Requirements¶
| Requirement | Target | Notes |
|---|---|---|
| Scale | 1000s builds/day across teams | Parallel job execution |
| Latency | Build time <10 min, deploy <5 min | Critical for velocity |
| Availability | 99.9% pipeline uptime | No lost commits |
| Consistency | Immutable builds, reproducible | Same code = same artifact |
| Scalability | 100 parallel job agents | Auto-scale based on queue |
| Rollback Time | <2 minutes to revert to prev version | Emergency requirement |
Step 2: API Design, Data Model & High-Level Design¶
Core API Endpoints¶
# Builds
POST /builds
{repo_url, branch, commit_sha, trigger_type: push|manual}
→ {build_id, status: queued}
GET /builds/{build_id}
→ {status, logs_url, artifact_urls, test_results}
GET /builds/{build_id}/logs?lines=100
→ {logs: [{timestamp, level, message}]}
# Deployments
POST /deployments
{build_id, target_env: staging|production, strategy: canary|blue-green}
→ {deployment_id, status: in_progress}
GET /deployments/{deployment_id}
→ {status, current_version, target_version, progress: 0-100}
PUT /deployments/{deployment_id}/approve
{approver_id}
→ {approved: true, proceed_to_prod: true}
PUT /deployments/{deployment_id}/rollback
{reason}
→ {rolled_back_to_version}
# Artifacts
GET /artifacts/{build_id}
→ {docker_image_uri, build_artifacts: [url, checksum]}
# Feature Flags
POST /feature-flags
{name, rollout_percentage: 0-100, target_users: []}
→ {flag_id}
GET /feature-flags/{flag_id}/status
→ {enabled: true, rollout_percentage, active_users}
Entity Data Model¶
REPOSITORIES
├─ repo_id (PK)
├─ repo_url, branch (default main)
├─ webhook_secret
├─ created_at
BUILDS
├─ build_id (ULID, PK, sortable)
├─ repo_id (FK)
├─ commit_sha (short SHA, indexed)
├─ commit_message
├─ author (committer info)
├─ status (queued, building, passed, failed, cancelled)
├─ trigger_type (push, manual, scheduled)
├─ started_at, ended_at
├─ duration (seconds)
├─ log_url (cloud storage path)
├─ docker_image_uri
├─ test_coverage (%)
├─ errors (array of error messages)
├─ created_at
BUILD_STAGES
├─ stage_id (PK)
├─ build_id (FK)
├─ stage_name (compile, unit_test, lint, integration_test, docker_build)
├─ status (queued, running, passed, failed)
├─ started_at, ended_at
├─ duration (seconds)
DEPLOYMENTS
├─ deployment_id (ULID, PK)
├─ build_id (FK)
├─ initiator_id (FK -> users)
├─ source_version, target_version
├─ target_environment (staging, production)
├─ strategy (canary, blue-green, rolling)
├─ status (in_progress, completed, failed, rolled_back)
├─ canary_percentage (% traffic for canary)
├─ approval_status (pending, approved, rejected)
├─ approver_id (FK, nullable)
├─ approved_at
├─ started_at, ended_at
├─ created_at
DEPLOYMENT_HEALTH
├─ health_id (PK)
├─ deployment_id (FK)
├─ metric_name (error_rate, latency_p99, pod_restart_count)
├─ baseline_value (previous version)
├─ current_value
├─ threshold_alert
├─ checked_at
FEATURE_FLAGS
├─ flag_id (ULID, PK)
├─ name (unique)
├─ enabled (boolean)
├─ rollout_percentage (0-100)
├─ target_users [user_ids] (array, for whitelist)
├─ created_at, updated_at
ARTIFACTS
├─ artifact_id (PK)
├─ build_id (FK)
├─ artifact_type (docker_image, jar, zip, etc.)
├─ uri (S3/registry path)
├─ file_size
├─ checksum (SHA256)
├─ created_at
High-Level Architecture¶
graph TB
DEV["👨💻 Developer<br/>(git push)"]
GITHUB["GitHub/GitLab<br/>(repository)"]
WEBHOOK["Webhook Server<br/>(event listener)"]
QUEUE["Job Queue<br/>(build tasks)"]
EXECUTOR["Build Executor<br/>(agents, K8s)"]
TESTS["Test Runner<br/>(unit, integration,<br/>coverage)"]
REGISTRY["Container Registry<br/>(Docker Hub,<br/>ECR, GCR)"]
ARTIFACT["Artifact Storage<br/>(S3, GCS)"]
CACHE["Build Cache<br/>(Redis, Layer Cache)"]
APPROVAL["Approval Gate<br/>(manual approval)"]
K8S["Kubernetes Cluster<br/>(staging, prod)"]
MONITORING["Monitoring<br/>(Prometheus, logs)"]
ROLLBACK["Rollback Service<br/>(revert to prev)"]
DEV --> GITHUB
GITHUB -->|webhook trigger| WEBHOOK
WEBHOOK --> QUEUE
QUEUE --> EXECUTOR
EXECUTOR --> TESTS
TESTS --> REGISTRY
EXECUTOR --> CACHE
EXECUTOR --> ARTIFACT
REGISTRY --> APPROVAL
APPROVAL --> K8S
K8S --> MONITORING
MONITORING --> ROLLBACK
Step 3: Concurrency, Consistency & Scalability¶
🔴 Problem: Preventing Concurrent Deployments to Same Service¶
Scenario: Two teams deploy different features to same service simultaneously. Second deployment overwrites first halfway through. Data corruption!
Solution: Distributed Lock on Service
Deployment Lock Mechanism:
1. Request lock before deployment:
LOCK deployment:service_name
owner=deployment_id_123
TTL=10_minutes
IF already_locked:
→ WAIT in queue, poll every 30 seconds
→ Timeout after 30 min, notify ops
ELSE:
→ Acquire lock, proceed with deployment
2. During deployment:
Check lock still owned by us:
GET lock:owner == our_deployment_id
IF lock lost:
→ ABORT deployment (another deployment took over)
ELSE:
→ Proceed with new version rollout
3. After deployment completes:
DELETE lock (release for next team)
Lock Implementation (using Redis):
SET lock:service:name
deployment_id_123
NX -- only if not exists
EX 600 -- expire after 10 min
→ Returns OK if lock acquired
→ Returns null if already locked
🟡 Problem: Blue-Green Deployment with Zero Downtime¶
Scenario: Production running v123. Deploy v124. If v124 has bugs, customer sees errors. Can't rollback instantly.
Solution: Blue-Green Deployment Strategy
Blue-Green Strategy:
Setup:
BLUE environment: [Pod1-v123, Pod2-v123, Pod3-v123] (current, live traffic)
GREEN environment: [Pod1-v124, Pod2-v124, Pod3-v124] (staging, warmup)
Load Balancer: Routes 100% traffic → BLUE
Deployment Steps:
1. Deploy to GREEN (no traffic)
kubectl deploy service:v124 → GREEN cluster
Verify GREEN health:
- Run smoke tests
- Check app startup
- Verify database migrations (if needed)
Result: GREEN ready, BLUE still serving all traffic
2. Health check GREEN:
GET /health → 200 OK
GET /metrics → latency < 100ms
If health check fails:
→ Keep BLUE running, abort swap
→ Notify team
3. Instant traffic switch (atomic):
Load Balancer switch:
BLUE: 100% → 0%
GREEN: 0% → 100%
(single rule update, milliseconds)
Result: Users routed to GREEN instantly
4. Monitor new version (5 min):
Track error_rate, latency, throughputs
IF error_rate > 1%:
→ Switch back to BLUE (rollback in <30 sec)
ELSE:
→ Keep GREEN, declare success
Rollback (instant):
If v124 has bugs:
Load Balancer switch:
GREEN: 100% → 0%
BLUE: 0% → 100%
Rollback time: <30 seconds
Data loss: Zero (BLUE still has data)
🔷 Problem: Canary Rollout with Progressive Traffic Shift¶
Scenario: Deploy v124 to 1% of users. If errors spike, auto-rollback. Else shift to 10%, 50%, 100%.
Solution: Traffic-Based Canary
Canary Progression:
Phase 1: 1% traffic (5 min)
Canary pods: [Pod1-v124] (1 replica)
Stable pods: [Pod1-v123, Pod2-v123, Pod3-v123] (99% traffic)
LB distributes:
- 99 requests → v123 (stable)
- 1 request → v124 (canary)
Metrics collected:
canary_error_rate = errors_v124 / requests_v124
stable_error_rate = errors_v123 / requests_v123
baseline_p99_latency = current_p99
Phase 2: Decision at 5 min mark
IF canary_error_rate > stable_error_rate * 2:
→ ABORT (rollback to v123)
→ Alert: "Canary failed, error spike detected"
ELSE IF canary_error_rate ≤ baseline_error_rate:
→ PROCEED to Phase 2 (10% traffic)
Phase 2: 10% traffic (5 min)
Canary pods: [Pod1-v124, Pod2-v124]
(same checks as Phase 1)
Phase 3: 50% traffic (10 min)
Phase 4: 100% traffic (complete)
Auto-Rollback Triggers:
- Error rate > 2× baseline
- P99 latency > baseline + 50%
- Pod restart rate > normal
- Memory usage > 80% threshold
Step 4: Persistence Layer, Caching & Monitoring¶
Database Design¶
CREATE TABLE repositories (
repo_id BIGSERIAL PRIMARY KEY,
repo_url VARCHAR(255) UNIQUE,
branch VARCHAR(100) DEFAULT 'main',
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE builds (
build_id BIGSERIAL PRIMARY KEY,
repo_id BIGINT NOT NULL REFERENCES repositories(repo_id),
commit_sha VARCHAR(40),
commit_message TEXT,
author VARCHAR(100),
status VARCHAR(50),
trigger_type VARCHAR(50),
started_at TIMESTAMP,
ended_at TIMESTAMP,
docker_image_uri TEXT,
test_coverage DECIMAL(5,2),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_builds_repo_status
ON builds(repo_id, status);
CREATE INDEX idx_builds_commit_sha
ON builds(commit_sha);
CREATE TABLE deployments (
deployment_id BIGSERIAL PRIMARY KEY,
build_id BIGINT NOT NULL REFERENCES builds(build_id),
initiator_id BIGINT,
source_version VARCHAR(50),
target_version VARCHAR(50),
target_environment VARCHAR(50),
strategy VARCHAR(50),
status VARCHAR(50),
approval_status VARCHAR(50),
started_at TIMESTAMP,
ended_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_deployments_status_created
ON deployments(status, created_at DESC);
CREATE TABLE feature_flags (
flag_id BIGSERIAL PRIMARY KEY,
name VARCHAR(100) UNIQUE,
enabled BOOLEAN DEFAULT FALSE,
rollout_percentage INT,
target_users BIGINT[],
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
Caching & Monitoring¶
Redis Caching:
1. Build cache (layer cache for Docker builds)
Key: "build:cache:{repo_id}:{branch}"
Value: compressed layer data
TTL: 7 days
2. Feature flag status
Key: "feature_flag:{flag_name}"
Value: {enabled, rollout_percentage}
TTL: 1 minute
3. Deployment locks
Key: "deploy:lock:{service_name}"
Value: {deployment_id, started_at}
TTL: 10 minutes (auto-release)
Monitoring Alerts:
- Build time increasing (>15 min average)
- Test failure rate spike
- Deployment rollback rate > 1%
- Container registry push latency > 30s
- Canary error rate > 2× baseline
⚡ Quick Reference Cheat Sheet¶
Tech Stack¶
CI/CD Platform: Jenkins, GitLab CI, GitHub Actions
Job Executor: Kubernetes, container-based runners
Container Registry: Docker Hub, ECR, GCR, Harbor
Artifact Storage: S3, GCS, Artifactory
Orchestration: Kubernetes (Helm for configs)
Monitoring: Prometheus, DataDog
Critical Decisions¶
- Master lock on deployment — Only one version rolling at a time
- Blue-green deployment — Instant rollback (< 30 sec)
- Canary with auto-abort — 1% → 10% → 50% → 100%
- Immutable artifacts — Same code = same docker image
- Feature flags — Separate feature rollout from code deployment
🎯 Interview Summary (5 Minutes)¶
- Distributed lock — Prevents concurrent deployments via Redis NX
- Blue-green strategy — Two environments, instant switch (<30s), zero downtime
- Canary rollout — Progressive traffic 1% → 100%, auto-abort on errors
- Immutable builds — Same commit SHA = reproducible artifact
- Automated tests — Unit, integration, smoke tests catch regressions early
- Feature flags — Decouple deployment from rollout
- Monitoring gates — Auto-rollback on error rate / latency spike