06 · Engineering at Scale

Level: Advanced — Organisational & Systems Pre-reading: 03 · Architect Thinking · 04 · Team Effectiveness

"Scale" in a principal engineering context means two things simultaneously: technical scale (systems handling millions of users and petabytes of data) and organisational scale (hundreds of engineers working on interdependent systems). Both must be reasoned about together.

The Two Dimensions of Scale

graph TD Scale --> TS[Technical Scale\nSystems, data, throughput] Scale --> OS[Organisational Scale\nEngineers, teams, decisions] TS --> TS1[Distributed systems\nCAP, BASE, consistency models] TS --> TS2[Data architecture\nStorage, processing, streaming] TS --> TS3[Infrastructure\nKubernetes, cloud, networking] OS --> OS1[Conway's Law\nOrg structure → system structure] OS --> OS2[Decision distribution\nWhere decisions are made] OS --> OS3[Knowledge management\nDocumentation, ADRs, RFCs]

Distributed Systems Fundamentals

The CAP Theorem

Eric Brewer's CAP Theorem (2000) states that a distributed data system can guarantee at most two of three properties simultaneously:

Property	Definition
Consistency (C)	Every read receives the most recent write or an error
Availability (A)	Every request receives a (non-error) response — but it might not be the most recent data
Partition Tolerance (P)	The system continues operating despite network partitions

The critical nuance

In any real distributed system, network partitions are not optional — they happen. Therefore, every distributed system must choose P, and the real trade-off is between C and A. The question is: when a network partition occurs, do you return an error (C) or a potentially stale result (A)?

System Type	CAP Choice	Example
Traditional RDBMS	CA (single node)	PostgreSQL, MySQL (no partition tolerance)
Distributed relational	CP	Google Spanner, CockroachDB
Key-value store	AP	Cassandra, DynamoDB (default), Riak
Document store	CP or AP configurable	MongoDB

BASE vs. ACID

Property	ACID (Traditional)	BASE (Distributed)
A	Atomic	Basically Available
C	Consistent	Soft-state
I	Isolated	Eventually consistent
D	Durable

Eventual consistency means: given no new updates, all replicas will eventually converge to the same value. The question architects must answer is: how long is "eventually" acceptable for this use case?

Use case	Consistency model required
Bank account balance	Strong consistency (ACID)
Social media "like" count	Eventual consistency
Shopping cart	Session consistency (user sees their own writes)
Recommendation engine	Eventual consistency
Inventory reservation	Strong consistency (or optimistic locking)
Distributed configuration	Eventual consistency with low-latency convergence

The Eight Fallacies of Distributed Computing

Peter Deutsch (Sun Microsystems, 1994) identified the assumptions engineers incorrectly make when building distributed systems:

#	Fallacy	Reality
1	The network is reliable	Networks fail: packets drop, connections reset, routing changes
2	Latency is zero	Every network hop adds latency; cross-region is 100ms+
3	Bandwidth is infinite	Serialisation, payload size, and network throughput all cost
4	The network is secure	Assume a hostile network; encrypt in transit, authenticate everything
5	Topology doesn't change	Services move, IPs change, Kubernetes restarts pods
6	There is one administrator	In cloud, many teams own different parts of the network
7	Transport cost is zero	Serialisation, compression, and network calls all have CPU and time cost
8	The network is homogeneous	Mixed protocols, versions, and data centres are normal

Interview application

When designing a distributed system in an interview, explicitly call out which fallacies you are designing against. This signals architectural maturity.

Scalability Patterns

Horizontal vs. Vertical Scaling

Approach	How	When
Vertical (Scale-Up)	Add more CPU/RAM to existing machines	Simple; no code change; has a hard ceiling; single point of failure
Horizontal (Scale-Out)	Add more machines; distribute load	No ceiling; stateless services required; more operational complexity

A principal engineer designs for horizontal scalability by default — vertical scaling is a stopgap, not an architecture.

Stateless vs. Stateful Services

	Stateless	Stateful
Definition	Each request is fully self-contained	Server maintains data between requests
Scalability	Trivially horizontal — any instance handles any request	Hard — requires affinity or shared external state
Resilience	Instance failure is transparent	Instance failure requires recovery/handover
Design principle	Move state to external store (database, cache, queue)	Necessary for streams, leader election, sessions

The key principle: push state down to dedicated state stores (databases, caches, message queues) and keep application services stateless.

Caching — The Architect's Primary Scaling Tool

Caching is the most impactful single scaling tool available. The art is in knowing what to cache, where, and for how long.

Cache Level	Technology	Latency	Best for
CPU L1/L2/L3	Hardware	nanoseconds	JVM / OS handled
In-process	`ConcurrentHashMap`, Guava Cache	microseconds	Computed values, short-lived config
Distributed	Redis, Memcached	< 1ms	Session data, hot query results, rate limiting
CDN	CloudFront, Fastly	< 10ms (edge)	Static assets, cacheable API responses
Database query cache	PostgreSQL, MySQL	Varies	Repeated identical queries

Cache Invalidation Strategies

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

Strategy	How	Trade-off
TTL-based expiry	Cache entries expire after N seconds	Simple; allows brief staleness
Write-through	Write to cache and DB simultaneously	Always consistent; write latency doubles
Write-behind (write-back)	Write to cache; async flush to DB	Lower write latency; risk of data loss
Cache-aside (lazy loading)	On miss, application loads from DB + populates cache	Simple; initial miss is slow; thundering herd risk
Event-driven invalidation	DB write event triggers cache eviction	Consistent; more complex infrastructure

Database Scaling Patterns

Pattern	Mechanism	Trade-off
Read replicas	Primary handles writes; replicas serve reads	Replication lag — replicas may be slightly stale
Sharding (horizontal partitioning)	Partition rows across multiple DB instances by shard key	Complex queries across shards; shard key design is critical
Vertical partitioning	Split columns across tables/databases	Reduces row width; joins become cross-database
CQRS	Separate read model from write model	Eventual consistency between models; powerful for complex queries
Connection pooling	Pool DB connections to amortise connection cost	PgBouncer, HikariCP; configured correctly prevents DB overload

Rate Limiting and Throttling

Pattern	Description	Use case
Token Bucket	N tokens added per second; each request consumes one	Allows bursts up to bucket size
Leaky Bucket	Requests enter a bucket; processed at constant rate	Smooth output rate; excess discarded or queued
Fixed Window Counter	Count requests per time window	Simple; susceptible to burst at window boundary
Sliding Window Log	Track timestamps of all requests in window	Accurate; memory-intensive at high scale
Sliding Window Counter	Weighted combination of two fixed windows	Good approximation; memory-efficient

API Design at Scale

Versioning Strategy

Strategy	Example	Trade-off
URI versioning	`/api/v1/orders`	Simple; cacheable; pollutes URL
Header versioning	`Accept: application/vnd.company.v2+json`	Clean URIs; harder to test in browser
Query param	`/orders?version=2`	Simple; caching complications
Content negotiation	`Accept` header, standard	RESTful; requires client sophistication

Backward Compatibility Rules (Postel's Law)

"Be conservative in what you send, liberal in what you accept."

Safe to add: new optional fields, new endpoints, new enum values (if client ignores unknown)
Breaking to change: field type changes, field removal, required field addition, URL structure change

Organisational Scale: Governance Without Bureaucracy

The Decision Distribution Problem

As organisations grow, the question becomes: who makes which decisions?

Decision type	Who decides	Governance mechanism
Reversible, local (e.g., internal algorithm)	Team	None needed — team autonomy
Reversible, cross-team (e.g., shared library version)	Teams + Principal	RFC with short review window
Irreversible, local (e.g., data schema)	Team + Architect	ADR required; review optional
Irreversible, cross-team (e.g., API contract change)	Architect + Principal	ADR + RFC + formal approval
Irreversible, org-wide (e.g., platform migration)	CTO + Principal + Architects	Architecture Board decision

Jeff Bezos's Type 1 / Type 2 decision framework: - Type 1: Irreversible — "walk through a one-way door". Require careful deliberation. - Type 2: Reversible — "walk through a two-way door". Make fast; adjust if wrong.

The governance anti-pattern

The most common governance failure is treating Type 2 decisions like Type 1 — creating heavy process for reversible decisions and grinding velocity to a halt. Principal Engineers must identify and protect Type 2 decision-making speed.

Twelve-Factor App Principles at Scale

The 12-Factor App methodology (Heroku, 2011) defines 12 principles for building cloud-native, scalable applications. Principal Engineers are expected to have mastered and applied all 12.

Factor	Principle	Why it matters at scale
I. Codebase	One codebase, tracked in version control, many deploys	Enables consistent deployment pipeline
II. Dependencies	Explicitly declare and isolate dependencies	No implicit system-level dependencies
III. Config	Store config in environment variables	Secrets never in code; environment-aware apps
IV. Backing services	Treat backing services as attached resources	Swap DB, queue, cache without code changes
V. Build/Release/Run	Strictly separate build, release, and run stages	Enables blue-green, canary deploys
VI. Processes	Execute app as stateless processes	Horizontal scalability
VII. Port binding	Export services via port binding	Service-to-service calls, container networking
VIII. Concurrency	Scale out via process model	Horizontal processes, not threads
IX. Disposability	Fast startup, graceful shutdown	Pod eviction, rolling deploys
X. Dev/Prod parity	Keep dev, staging, prod as similar as possible	Reduces configuration drift bugs
XI. Logs	Treat logs as event streams	Centralised log aggregation
XII. Admin processes	Run admin tasks as one-off processes	Database migrations, data fixes

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search