8.03 · Deep Dive: Apache Cassandra for High-Scale Analytics

Level: Advanced Time to read: 15 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand Cassandra's distributed architecture, when to use it for analytics, and trade-offs vs. traditional data warehouses.

Cassandra: Distributed NoSQL for Scale

Architecture

Cassandra Cluster (Multi-node)
┌──────────────┬──────────────┬──────────────┐
│   Node 1     │   Node 2     │   Node 3     │
├──────────────┼──────────────┼──────────────┤
│ Partition:   │ Partition:   │ Partition:   │
│ Token Range  │ Token Range  │ Token Range  │
│ 0-100        │ 100-200      │ 200-300      │
│              │              │              │
│ Data:        │ Data:        │ Data:        │
│ Customer1-5  │ Customer6-10 │ Customer11-15│
└──────────────┴──────────────┴──────────────┘

Key properties:
✅ Horizontal scaling (add nodes, data auto-distributes)
✅ No single point of failure (replication)
✅ Tunable consistency (AP vs CP)
❌ Not optimized for aggregations

Strengths

✅ Massive scale (millions of writes/sec, PB+ storage) ✅ Horizontal scaling (add nodes to scale linearly) ✅ High availability (no downtime, automatic replication) ✅ Fast writes (write-optimized storage) ✅ Decentralized (no single point of failure)

Weaknesses for Analytics

❌ No JOINs (denormalized data, hard to analyze) ❌ Limited aggregations (must compute in application) ❌ Complex queries hard (designed for simple lookups) ❌ Eventual consistency (read-after-write issues) ❌ Complex to operate (requires cluster expertise)

When to Use Cassandra

Good Fit

✅ Time-series data (IoT sensors, metrics, logs)
✅ Session management (billions of user sessions)
✅ Write-heavy workloads (millions of events/sec)
✅ Mobile apps (offline-first, sync later)

Poor Fit

❌ Complex analytics
❌ Ad-hoc queries
❌ JOIN-heavy workloads
❌ Small data (< 100GB)

Cassandra for Analytics Pattern

Cassandra as hot storage + separate analytics warehouse:

Operational Cassandra
    (fast writes, high volume)
    ↓
ETL Pipeline (export to Parquet)
    ↓
Cloud Data Warehouse (BigQuery)
    (analytics queries)
    ↓
Reports & Dashboards

Example: Cassandra Table Design

-- Time-series table (optimized for reads by timestamp)
CREATE TABLE metrics (
  metric_name TEXT,
  timestamp TIMESTAMP,
  sensor_id TEXT,
  value DOUBLE,
  PRIMARY KEY ((metric_name, sensor_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- Partition key: (metric_name, sensor_id)
--   → All data for one metric-sensor combo on same partition
-- Clustering key: timestamp (DESC)
--   → Sorted by time within partition
--   → Fast range queries (recent 1 hour)

-- Query: Get last 1 hour of metrics (efficient)
SELECT * FROM metrics
WHERE metric_name = 'cpu_usage'
  AND sensor_id = 'server1'
  AND timestamp >= now() - INTERVAL 1 HOUR;
-- Reads single partition slice → fast!

-- Anti-query (SLOW!)
SELECT * FROM metrics
WHERE timestamp >= now() - INTERVAL 1 HOUR;
-- Scans ALL partitions (full cluster scan!)

Key Takeaways

Cassandra is for scale, not analytics
Use separate DWH for analytical queries
Denormalize aggressively in Cassandra
Design for access patterns, not flexibility
Export to cloud DWH for complex analytics

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search