Wide Column Database¶

Interview Time: 60-90 min | Difficulty: Hard
Key Focus: Column families, range queries, compression, time-series optimization

Step 1: Functional & Non-Functional Requirements¶

Functional Requirements¶

Row-oriented storage with multiple column families
Put/Get/Delete operations on full rows
Range queries (scan key range)
Column-level TTL (auto-delete old data)
Atomic row updates (all-or-nothing)
Secondary indices on columns
Batch operations (multi-row)
Compression of column families
Support sparse columns (lots of NULLs)
Versioning (multiple versions of cell values)

Non-Functional Requirements¶

Requirement	Target	Notes
Scale	1 Petabyte data, 1B rows, 100TB/month growth	Compressed from 5PB raw
Throughput	Read: 1M QPS, Write: 100K QPS	Write-heavy workload
Latency	p99 <100ms (local), <500ms (remote)	Acceptable latency for analytics
Consistency	Strong (within single row)	Distributed across regions
Availability	99.99% uptime	Automatic replication
Storage Efficiency	Compression 5-10x (raw → stored)	Column families compress well
Fault Tolerance	Survive 2 node failures (replication factor 3)	Rack-aware placement

Step 2: API Design, Data Model & High-Level Design¶

Core API Endpoints¶

PUT /table/{row_key}
  {column_families: {
    personal: {name: "John", age: 30},
    contact: {email: "john@example.com", phone: "555-1234"}
  }, timestamp_ms: 1234567890}
  → {success: true}

GET /table/{row_key}
  → {columns: {personal: {name, age}, contact: {email, phone}}, timestamp: 1234567890}

GET /table?start_key=user:1000&end_key=user:2000&max_results=1000
  → {rows: [{row_key, columns}], next_token}

DELETE /table/{row_key}/columns/{column_family}:{column_name}
  → {success: true}

POST /table/scan
  {row_range: {start: "user:1000", end: "user:2000"}, 
   filters: [{column: "age", operator: ">", value: 18}]}
  → {rows: [...], scan_token}

Entity Data Model¶

ROW_STORAGE (HBase/Cassandra model)
├─ row_key (PK, often composite: user_id:timestamp)
├─ column_family (e.g., personal, contact, timeline)
│  ├─ column_name (e.g., age, email, post_id)
│  ├─ value (BLOB)
│  └─ timestamp_ms (version history)
├─ ttl_seconds (auto-delete if set)
├─ tags (metadata)

COLUMN_FAMILY (logical grouping)
├─ family_name (e.g., "profile" for user data)
├─ compression (SNAPPY, ZSTD, LZ4)
├─ bloom_filter (for fast non-existence checks)
├─ block_cache_enabled (keep frequently accessed in memory)
├─ ttl_seconds (default for cells in family)

MEMTABLE (in-memory write buffer)
├─ row_key → sorted list of cells
├─ size_mb (configurable, e.g., 64MB)
├─ comparator (how to sort cells: lexicographic, etc.)

SSTABLE (on-disk immutable file)
├─ key_range (min/max keys in file)
├─ index (sparse index for binary search)
├─ bloom_filter (check if key exists)
├─ data_blocks (compressed column data)
├─ metadata (compression type, created_at)

High-Level Architecture¶

graph TB
    Client["Client<br/>(application)"]
    LB["Load Balancer"]

    Server1["Server 1<br/>(region leader)"]
    Server2["Server 2<br/>(replica)"]
    ServerN["Server N<br/>(replica)"]

    Memtable1["Memtable<br/>(write buffer)"]
    WAL["Write-Ahead Log<br/>(durability)"]
    Compactor["Compactor<br/>(merge SSTables)"]

    SSTable1["SSTable 1"]
    SSTable2["SSTable 2"]
    SSTables["..."]

    BlockCache["Block Cache<br/>(L1 row cache)"]
    BloomFilter["Bloom Filter<br/>(fast non-existence)"]

    Replication["Replication<br/>(commit log)"]
    SecondaryIndex["Secondary Index<br/>(column search)"]

    Monitoring["Monitoring<br/>(compaction · latency)"]

    Client --> LB
    LB --> Server1
    LB --> Server2
    LB --> ServerN

    Server1 --> Memtable1
    Memtable1 --> WAL
    Memtable1 --> BlockCache

    Compactor --> SSTable1
    Compactor --> SSTable2
    Compactor --> SSTables

    Server1 --> SSTable1
    Server1 --> BloomFilter
    Server1 --> SecondaryIndex

    Server1 -->|async| Replication
    Replication --> Server2
    Replication --> ServerN

    Server1 --> Monitoring

Step 3: Concurrency, Consistency & Scalability¶

🔴 Problem: Write Amplification (Compaction Overhead)¶

Scenario: 1M writes/sec pile up in memtables. When memtable fills (64MB), flush to SSTables. With reads scattered across 100 SSTables, each read requires 100 disk seeks.

Solutions:

Approach	Implementation	Pros	Cons
Leveled Compaction	Keep L0 as multiple SSTables, L1-L∞ single SSTables	Fewer files (10-20), fast reads	More write I/O during compaction
Tiered Compaction	Wait until N SSTables accumulate, compact all together	Fewer compactions, smoother writes	Spiky compaction, reads slower
Size-Tiered	Compact when N files of same size appear	Balanced read/write	Medium overhead
Bloom Filter Optimization	Skip reading SSTables if key not in bloom	Reduce unnecessary reads	Probabilistic (false positives)

Recommended: Leveled Compaction + Bloom Filters + Block Cache

class LSMTree:
    def __init__(self):
        self.levels = {
            0: [],    # 4 SSTables (L0)
            1: [],    # 10 SSTables (L1)
            2: [],    # 100 SSTables (L2)
            ...       # Each level is 10x larger
        }
        self.memtable = Memtable(max_size_mb=64)
        self.block_cache = BlockCache(max_size_mb=512)

    def write(self, key: str, value: str):
        """Write to memtable (in-memory buffer)"""

        self.memtable.put(key, value)
        self.wal.write(f"PUT {key}={value}")  # Durability

        # When memtable full, flush to SSTable
        if self.memtable.size_mb > 64:
            self.flush_memtable_to_l0()

    def flush_memtable_to_l0(self):
        """Flush in-memory memtable to L0 SSTables"""

        data = self.memtable.read_all()
        sstable = SSTable(data=data, level=0)
        sstable.bloom_filter = self.build_bloom_filter(data.keys())

        self.levels[0].append(sstable)
        self.memtable = Memtable()  # New empty memtable

        # Trigger compaction if L0 has >4 SSTables
        if len(self.levels[0]) > 4:
            self.compact_l0_to_l1()

    def compact_l0_to_l1(self):
        """Leveled compaction: merge L0 into L1 (background)"""

        # Read all L0 SSTables
        l0_data = []
        for sstable in self.levels[0]:
            l0_data.extend(sstable.read_all())

        # Sort and deduplicate (keep latest version)
        l0_data = sorted(l0_data, key=lambda x: (x.key, -x.version))
        l0_data = [next(g) for k, g in itertools.groupby(l0_data, key=lambda x: x.key)]

        # Merge with overlapping L1 SSTables (leveled compaction: 1 L1 file)
        l1_data = self.read_overlapping_l1_sstables(l0_data)

        merged_data = merge_sorted(l0_data, l1_data)

        # Write back as single L1 SSTable (or split if too large)
        new_l1_sstable = SSTable(data=merged_data, level=1)
        self.levels[1].append(new_l1_sstable)

        # Remove old L0 SSTables
        self.levels[0] = []

        # Recursively compact L1 if too many files
        if len(self.levels[1]) > 10:
            self.compact_l1_to_l2()

Write Amplification Calculation:

Leveled Compaction:
  To write 1 MB to disk:
    - 1 write to memtable (RAM)
    - 1 write to WAL (durability)
    - 1 write to L0 SSTable
    - During compaction: read from L0 (1) + L1 (1), write to L1 (1) = 3 I/Os
    - Amortized write amplification: 1 (initial) + 3 (compaction overhead) = 4x

Tiered Compaction:
  - Fewer compaction events, but each is larger (spiky I/O)
  - Better for write-heavy, worse for read latency

Size-Tiered (HBase):
  - More balanced, 5-10x write amplification

🟡 Problem: Sparse Columns & Storage Waste¶

Scenario: Table has 1000 columns, but each row uses only 10. Storing empty columns wastes space.

Solution: Columnar storage + lazy column deserialization

class ColumnFamily:
    def __init__(self, name: str, compression: str = 'ZSTD'):
        self.name = name
        self.compression = compression  # Compress at column-level
        self.columns = defaultdict(list)  # column_name → [(timestamp, value)]

    def write_sparse_row(self, row_key: str, columns_dict: dict):
        """Write only present columns, skip empty ones"""

        for column_name, value in columns_dict.items():
            if value is not None:  # Only write non-NULL columns
                self.columns[column_name].append({
                    'timestamp': now(),
                    'value': value,
                    'row_key': row_key
                })

    def read_sparse_row(self, row_key: str, requested_columns=None):
        """Read only requested columns (lazy loading)"""

        result = {}

        # If specific columns requested, read only those
        columns_to_read = requested_columns or list(self.columns.keys())

        for column_name in columns_to_read:
            if column_name in self.columns:
                # Decompress column data for this column
                column_data = self.decompress_column(column_name)

                # Filter by row_key
                row_values = [v for v in column_data if v['row_key'] == row_key]

                if row_values:
                    result[column_name] = row_values[-1]['value']  # Latest version

        return result

def estimate_storage_savings():
    """Compare dense vs sparse storage"""

    # Dense schema: all columns stored even if NULL
    num_rows = 1e9
    num_columns = 1000
    bytes_per_column = 50  # Average column value size
    columns_per_row = 10   # Average columns filled per row

    # Dense storage: store all columns (waste for empty)
    dense_storage = num_rows * num_columns * bytes_per_column
    print(f"Dense storage: {dense_storage / 1e12:.1f} TB")

    # Sparse storage: only store filled columns + metadata
    sparse_storage = num_rows * columns_per_row * bytes_per_column
    sparse_storage += num_rows * columns_per_row * 20  # Metadata per cell
    print(f"Sparse storage: {sparse_storage / 1e12:.1f} TB")

    # Compression on sparse
    compressed_sparse = sparse_storage * 0.2  # ZSTD at 80% reduction
    print(f"Sparse + compression: {compressed_sparse / 1e12:.1f} TB")

    # Savings
    savings_pct = (1 - compressed_sparse / dense_storage) * 100
    print(f"Savings: {savings_pct:.1f}%")

Storage Comparison:

Dense (all columns stored): 50 TB
Sparse (only filled columns): 5 TB (10x better!)
Sparse + ZSTD compression: 1 TB (50x better!)

💾 Problem: Range Query Performance (Scan Key Range)¶

Scenario: Query "get all user data for user_ids in range [10000, 20000]". Must scan multiple SSTables, potential slow query.

Solution: Range queries with bloom filters + secondary indices

class RangeScanner:
    def scan_range(self, start_key: str, end_key: str, limit: int = 10000):
        """Efficiently scan range of keys"""

        results = []

        # Find SSTables that could contain keys in this range
        candidate_sstables = self.find_candidate_sstables(start_key, end_key)
        print(f"Scanning {len(candidate_sstables)} SSTables (out of {len(self.all_sstables)} total)")

        for sstable in candidate_sstables:
            # Use sparse index to jump to start_key
            block_index = sstable.find_block_containing_key(start_key)

            # Read blocks in range
            for block in sstable.blocks[block_index:]:
                for key, value in block.read():
                    if start_key <= key <= end_key:
                        results.append((key, value))

                        if len(results) >= limit:
                            return results
                    elif key > end_key:
                        return results  # Past our range

        return results

    def find_candidate_sstables(self, start_key: str, end_key: str):
        """Use bloom filters to skip SSTables that definitely don't contain keys"""

        candidates = []

        for sstable in self.all_sstables:
            # Quick check: key range metadata
            if start_key > sstable.max_key or end_key < sstable.min_key:
                continue  # This SSTable doesn't overlap our range

            # Bloom filter check (probabilistic)
            # Check if range.start might be in this SSTable
            if sstable.bloom_filter.might_contain(start_key):
                candidates.append(sstable)

        return candidates

    def create_secondary_index(self, column_name: str):
        """Create secondary index on a column for faster queries"""

        # Secondary index: column_value → [row_keys]
        index = defaultdict(set)

        # Scan all rows, build index
        for sstable in self.all_sstables:
            for row in sstable.read_all():
                column_value = row.get(column_name)
                if column_value:
                    index[column_value].add(row['row_key'])

        # Store index as separate SSTable
        self.secondary_indices[column_name] = index

        # Can now query: "where status=ACTIVE" in O(log N) instead of O(N)

Step 4: Persistence Layer, Caching & Monitoring¶

Database Design (HBase/Cassandra Physical Layout)¶

-- Row storage (HBase/Cassandra table)
CREATE TABLE users (
    row_key TEXT PRIMARY KEY,  -- e.g., "user:12345"

    -- Column Family: Personal Information
    personal:name TEXT,
    personal:email TEXT,
    personal:age INT,
    personal:phone TEXT,

    -- Column Family: Profile Details
    profile:bio TEXT,
    profile:avatar_url TEXT,
    profile:created_at TIMESTAMP,
    profile:updated_at TIMESTAMP,

    -- Column Family: Timeline (wide, sparse)
    timeline:post_1234 TEXT,
    timeline:post_5678 TEXT,
    timeline:[many more posts, only recent kept]
) WITH
    COMPRESSION = 'ZSTD',
    BLOOM_FILTER = true,
    BLOCK_CACHE_SIZE = 512MB,
    TTL = 63072000;  -- 2 years, old posts auto-deleted

-- Each cell has: (row_key, column_family, column_name, timestamp) → value

-- Separate SSTable files per column family (allows column-level compression)
-- personal CF compressed at 70%
-- timeline CF compressed at 85% (repeated structure)

-- Write-Ahead Log (durable recovery)
CREATE TABLE wal (
    server_id TEXT,
    sequence_number BIGINT PRIMARY KEY,
    operation TEXT,  -- PUT, DELETE
    row_key TEXT,
    column_family TEXT,
    column TEXT,
    value BLOB,
    timestamp_ms INT
);

-- Memtable index (in-memory)
CREATE INDEX idx_memtable ON users(row_key, column_family, column_name);

-- Secondary indices
CREATE INDEX idx_users_email ON users(personal:email);
CREATE INDEX idx_users_created_at ON users(profile:created_at);

Caching Strategy¶

TIER 1: Block Cache (in-memory, per SSTable block)
├─ LRU cache of 512MB per server
├─ Caches decompressed column data blocks
└─ Survives server restart (can be dumped to disk)

TIER 2: Row Cache (optional, application-level)
├─ Cache frequently accessed rows (e.g., user profiles)
├─ TTL-based invalidation
└─ Useful for hot rows (celebrities, VIPs)

TIER 3: Bloom Filter (one per SSTable)
├─ O(1) negative lookups (definitely not in this file)
├─ ~1% false positive rate
└─ Skip disk seeks for missing keys

TIER 4: Disk (SSTables, compressed)
├─ Write-optimized (append-only)
├─ Compacted periodically (background)
└─ Tiered by temperature

Invalidation:
- Write to row → update memtable immediately
- Memtable flush → new SSTable (no cache invalidation needed)
- TTL expiry → automatic deletion via background scan

Monitoring & Alerts¶

Key Metrics:

Write Latency — Time to persist write (target: <10ms p99)
Read Latency — Time to serve read (target: <100ms p99)
Compaction Impact — % of time spent in compaction (target: <20%)
Bloom Filter Hit Rate — % of neg lookups avoided (target: >90%)
SSTable Count — How many files in LSM tree (target: <50 per level)

- alert: WriteLatencyHigh
  expr: histogram_quantile(0.99, rate(write_latency_seconds_bucket[5m])) > 0.01
  annotations: "Write p99: {{$value}}s (target: <10ms)"

- alert: ReadLatencyHigh
  expr: histogram_quantile(0.99, rate(read_latency_seconds_bucket[5m])) > 0.1
  annotations: "Read p99: {{$value}}s (target: <100ms)"

- alert: CompactionOverhead
  expr: rate(compaction_time_seconds[5m]) / rate(total_time_seconds[5m]) > 0.30
  annotations: "Compaction {{$value | humanizePercentage}} of time (excess?)"

- alert: SSTables TooMany
  expr: count(sstable_files) > 100
  for: 10m
  annotations: "{{$value}} SSTables (compaction falling behind?)"

- alert: BlockCachePressure
  expr: block_cache_eviction_rate > 1000
  annotations: "Block cache evicting {{$value}}/sec items (increase cache size?)"

- alert: MemtablePressure
  expr: rate(memtable_flush_total[5m]) > 100
  annotations: "Memtables flushing {{$value}}/sec (write load too high?)"

⚡ Quick Reference Cheat Sheet¶

When to Use What¶

Need	Technology	Why
Write-heavy	LSM Tree (HBase/Cassandra)	Append-only memtable, fast writes
Compression	ZSTD at column-family level	80-90% reduction on repetitive data
Range queries	Sparse index + bloom filters	Skip unnecessary SSTable reads
Sparse columns	columnar storage (per-column)	Only store present data
Read amplification	Leveled compaction	Limit SSTable count (fewer seeks)
Hot row caching	Row cache (optional)	Avoid repeated disk access

Critical Design Decisions¶

Leveled Compaction: Fewer files (10-20), fast range queries; moderate write overhead
Bloom Filters: Skip SSTables where key definitely absent (90%+ false-negative savings)
Column Family: Compress each family separately; columns evolve independent
Sparse storage: Only persist populated columns; ~10-50x space savings
Block cache: Keep decompressed blocks in RAM; trade memory for CPU (decompression)
TTL per column: Auto-delete old data (time-series use case); don't block application

Tech Stack Summary¶

Data Structure: LSM Tree (memtable → LSTable → SSTables)
Compaction: Leveled compaction (HBase/Cassandra style)
Compression: ZSTD per column family (80% reduction)
Indexing: Sparse block index + bloom filter
Replication: Quorum (N=3, W=2, R=2) by row_key hash
Monitoring: Latency, compaction overhead, SSTable count

🎯 Interview Summary (5 Minutes)¶

LSM Tree storage: Writes go to memtable (RAM → WAL for durability), flush to SSTables when full
Leveled compaction: Merge SSTables in background; keep files per level = 10x previous (fewer seeks)
Columnar storage: Compress each column family independently; sparse columns save 10-50x space
Bloom filters: Quick negative lookups (90% skip unnecessary disk reads for non-existent keys)
Range queries: Use sparse index to jump to start key; bloom filter skips non-overlapping SSTables
Block cache: Keep decompressed blocks in RAM; trade memory for CPU (avoid decompression overhead)
TTL support: Auto-delete old cells per column family; useful for time-series (keep recent, purge old)