System Design Interview Cheatsheet
5 minute read
Quick reference for system design interviews - executive summaries only.
Core Concepts Quick Reference
| Concept |
One-Liner |
When to Use |
| Horizontal Scaling |
Add more machines |
When vertical limits reached |
| Vertical Scaling |
Add more resources to machine |
Simple cases, quick fix |
| Load Balancer |
Distribute traffic across servers |
Multiple servers |
| CDN |
Cache content at edge locations |
Static content, global users |
| Cache |
Store frequently accessed data in memory |
Read-heavy workloads |
| Message Queue |
Async communication buffer |
Decouple services |
| Database Replication |
Copy data across nodes |
High availability, read scaling |
| Database Sharding |
Split data across nodes |
Write scaling, large datasets |
Database Selection Guide
| Use Case |
Recommended |
Why |
| Transactions, ACID |
PostgreSQL, MySQL |
ACID compliance |
| Document storage |
MongoDB |
Flexible schema |
| Wide-column, time-series |
Cassandra, HBase |
Write-heavy, scale |
| Key-Value cache |
Redis, Memcached |
Fast lookups |
| Graph relationships |
Neo4j |
Complex relationships |
| Search |
Elasticsearch |
Full-text search |
| Analytics (OLAP) |
ClickHouse, BigQuery |
Columnar, aggregations |
| Blob storage |
S3, GCS |
Large files |
Consistency vs Availability
| Choose Consistency When |
Choose Availability When |
| Financial transactions |
Social media feeds |
| Inventory management |
User preferences |
| Booking systems |
Analytics dashboards |
| Leader election |
Metrics collection |
| Distributed locks |
Shopping cart |
Common Patterns Quick Reference
Caching Patterns
| Pattern |
Write |
Read |
Best For |
| Cache-Aside |
App writes to DB |
Check cache, fallback to DB |
General purpose |
| Read-Through |
N/A |
Cache fetches from DB |
Simple reads |
| Write-Through |
Write cache + DB sync |
From cache |
Consistency needed |
| Write-Behind |
Write cache, async DB |
From cache |
Write-heavy |
Rate Limiting Algorithms
| Algorithm |
Description |
Pros/Cons |
| Token Bucket |
Tokens refill at rate, requests consume tokens |
Allows bursts |
| Leaky Bucket |
Fixed outflow rate |
Smooth output |
| Fixed Window |
Counter per time window |
Simple, edge spikes |
| Sliding Window |
Moving window counter |
Accurate, more memory |
ID Generation
| Method |
Size |
Sortable |
Coordination |
| UUID v4 |
128 bit |
No |
None |
| Snowflake |
64 bit |
Yes |
Machine ID |
| ULID |
128 bit |
Yes |
None |
| Auto-increment |
Variable |
Yes |
Required |
Data Structures for Scale
| Structure |
Purpose |
Use Case |
| Bloom Filter |
Probably in set |
Cache lookup, spam filter |
| HyperLogLog |
Count unique items |
Unique visitors |
| Count-Min Sketch |
Frequency estimation |
Top-K, heavy hitters |
| Consistent Hash |
Distribute across nodes |
Distributed cache |
| Merkle Tree |
Detect data differences |
Anti-entropy sync |
| LSM Tree |
Write-optimized storage |
Write-heavy DBs |
| B+ Tree |
Read-optimized index |
Database indexes |
| Skip List |
Sorted data structure |
Redis sorted sets |
| Trie |
Prefix lookups |
Autocomplete |
| Geohash/S2 |
Location encoding |
Proximity search |
| Inverted Index |
Text to documents |
Full-text search |
Communication Patterns
| Protocol |
Use When |
Latency |
Complexity |
| REST/HTTP |
Standard APIs |
Medium |
Low |
| gRPC |
Microservices |
Low |
Medium |
| WebSocket |
Real-time bidirectional |
Low |
Medium |
| SSE |
Server push |
Low |
Low |
| Message Queue |
Async, decoupling |
Variable |
Medium |
| GraphQL |
Flexible queries |
Medium |
High |
Scaling Numbers
QPS Guidelines
| Service Type |
Typical QPS/Server |
| Web server |
1K-10K |
| Database (simple) |
1K-5K |
| Cache (Redis) |
100K+ |
| Load Balancer |
100K+ |
Storage Rules of Thumb
- Text tweet (140 chars): ~280 bytes
- User record: ~1 KB
- Image thumbnail: ~20 KB
- HD Image: ~2 MB
- 1 minute video (compressed): ~10 MB
Time Estimates
- Seconds in day: 86,400 ≈ 100K
- Seconds in month: 2.6M ≈ 3M
- Seconds in year: 31.5M ≈ 30M
Reliability Patterns
| Pattern |
Purpose |
Implementation |
| Circuit Breaker |
Fail fast, prevent cascade |
Hystrix, Resilience4j |
| Retry with Backoff |
Handle transient failures |
Exponential backoff |
| Bulkhead |
Isolate failures |
Thread/connection pools |
| Timeout |
Don’t wait forever |
Set aggressive timeouts |
| Fallback |
Graceful degradation |
Default/cached values |
| Health Check |
Detect failures |
/health endpoint |
| Idempotency |
Safe retries |
Idempotency keys |
Distributed Transactions
| Pattern |
Consistency |
Complexity |
Use Case |
| 2PC |
Strong |
High |
Short transactions |
| Saga |
Eventual |
Medium |
Long transactions |
| Outbox |
At-least-once |
Low |
DB + Events |
| TCC |
Strong |
High |
Reservations |
Consensus & Coordination
| Tool |
Use For |
| Zookeeper |
Leader election, config, locks |
| etcd |
Kubernetes, config, service discovery |
| Consul |
Service mesh, discovery, config |
| Redis |
Distributed locks (Redlock) |
Observability Stack
| Layer |
Tools |
| Metrics |
Prometheus, Grafana, Datadog |
| Logging |
ELK Stack, Loki, Splunk |
| Tracing |
Jaeger, Zipkin, X-Ray |
| Alerting |
PagerDuty, OpsGenie |
Security Checklist
Interview Questions Mapped to Concepts
| Question |
Key Concepts |
| Design URL Shortener |
Hash, Base62, Cache, Database |
| Design Twitter |
Feed fanout, Timeline, Sharding |
| Design Chat System |
WebSocket, Presence, Message Queue |
| Design YouTube |
CDN, Transcoding, Adaptive Streaming |
| Design Instagram |
Object Storage, CDN, Feed |
| Design Uber |
Geospatial Index, Matching, Real-time |
| Design Rate Limiter |
Token Bucket, Distributed Counter |
| Design Notification |
Push, Queue, Priority |
| Design Search |
Inverted Index, Ranking, Sharding |
| Design File Storage |
Chunks, Replication, Metadata |
High-Level Design Template
┌─────────────────────────────────────────────────────────────┐
│ Clients │
│ (Web, Mobile, API) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CDN │
│ (Static content) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (L7, Health checks) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API Gateway │
│ (Auth, Rate limit, Routing, SSL termination) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Service A │ │ Service B │ │ Service C │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Cache │ │ Database │ │ Message Queue │
│ (Redis) │ │ (Primary) │ │ (Kafka) │
└───────────────┘ └───────────────┘ └───────────────┘
│
▼
┌───────────────┐
│ Replicas │
└───────────────┘
QPS
QPS = (DAU × actions_per_user) / seconds_in_day
Peak QPS ≈ 2-3 × average QPS
Storage
Storage = users × data_per_user × retention_period
Growth = new_users_per_day × data_per_user
Bandwidth
Bandwidth = QPS × request_size (inbound)
Bandwidth = QPS × response_size (outbound)
Cache Size
Cache = QPS × cache_entry_size × cache_duration
Cache hit ratio target: 80-95%
Servers Needed
Servers = Peak_QPS / QPS_per_server
Add 20-30% for headroom
Tradeoff Discussions
| Decision |
Option A |
Option B |
| Consistency vs Availability |
CP (consistency) |
AP (availability) |
| SQL vs NoSQL |
ACID, joins, schema |
Scale, flexibility |
| Sync vs Async |
Immediate feedback |
Decoupling, scale |
| Push vs Pull |
Real-time, more connections |
Polling, simpler |
| Cache vs Fresh |
Speed |
Accuracy |
| Monolith vs Microservices |
Simple, fast |
Scale, teams |
| Embedded vs Managed |
Control, cost |
Ease, reliability |
Red Flags to Avoid
❌ Single point of failure
❌ No caching strategy
❌ Ignoring data growth
❌ No backup/recovery plan
❌ Tight coupling between services
❌ No rate limiting
❌ Synchronous calls for everything
❌ Not considering peak load
❌ Ignoring security
❌ No monitoring/alerting
Related Articles