09 · Sharding & Partitioning — Split Data Across Multiple Machines
Scaling Writes · Topic 9 of 10
The Problem
A single node has finite storage and write throughput. Sharding (horizontal partitioning) splits data across multiple nodes so each node owns a subset.
Partitioning Strategies
Range Partitioning
Data divided by key ranges. Supports range queries efficiently.
- ✅ Range queries efficient
- ❌ Hotspots if data is skewed (e.g., sequential IDs all land on one shard)
Hash Partitioning
Key is hashed and mapped to a shard. Uniform distribution.
- ✅ Even distribution
- ❌ Range queries require scatter-gather across all shards
Directory-Based Partitioning
A lookup table maps keys to shards.
- ✅ Flexible — can move data without rehashing
- ❌ Lookup table is a single point of failure / bottleneck
Consistent Hashing
See Consistent Hashing — the preferred approach for dynamic shard counts.
Cloud Implementations
- Native declarative partitioning:
PARTITION BY RANGE/LIST/HASH - Logical sharding: Citus extension for distributed PostgreSQL
- Each partition is a separate table internally
- Auto-sharding (splits) based on load and data size
- Key-range based; interleaved tables keep parent/child on same split
- Choose keys to avoid hotspots: avoid monotonically increasing PKs
- Partition key determines the shard
- Each partition: up to 10 GB, 3,000 RCU, 1,000 WCU
- Hot partition problem: use random suffix or write sharding for high-throughput keys
- Partition key hashed onto the ring (consistent hashing)
RFcopies per partition spread across nodes- Choose partition key so partitions stay under ~100 MB
- Sharding via
mongosrouter + config servers - Shard key:
{user_id: 1}or{user_id: "hashed"} - Chunk-based: data split into 64 MB chunks, balanced across shards
Cross-Shard Queries
Scatter-gather
Queries without the shard key broadcast to all shards and merge results — expensive. Design query patterns to always include the shard key.