8.03 · Deep Dive: Apache Cassandra for High-Scale Analytics
Level: Advanced Time to read: 15 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand Cassandra's distributed architecture, when to use it for analytics, and trade-offs vs. traditional data warehouses.
Cassandra: Distributed NoSQL for Scale
Architecture
Cassandra Cluster (Multi-node)
┌──────────────┬──────────────┬──────────────┐
│ Node 1 │ Node 2 │ Node 3 │
├──────────────┼──────────────┼──────────────┤
│ Partition: │ Partition: │ Partition: │
│ Token Range │ Token Range │ Token Range │
│ 0-100 │ 100-200 │ 200-300 │
│ │ │ │
│ Data: │ Data: │ Data: │
│ Customer1-5 │ Customer6-10 │ Customer11-15│
└──────────────┴──────────────┴──────────────┘
Key properties:
✅ Horizontal scaling (add nodes, data auto-distributes)
✅ No single point of failure (replication)
✅ Tunable consistency (AP vs CP)
❌ Not optimized for aggregations
Strengths
✅ Massive scale (millions of writes/sec, PB+ storage) ✅ Horizontal scaling (add nodes to scale linearly) ✅ High availability (no downtime, automatic replication) ✅ Fast writes (write-optimized storage) ✅ Decentralized (no single point of failure)
Weaknesses for Analytics
❌ No JOINs (denormalized data, hard to analyze) ❌ Limited aggregations (must compute in application) ❌ Complex queries hard (designed for simple lookups) ❌ Eventual consistency (read-after-write issues) ❌ Complex to operate (requires cluster expertise)
When to Use Cassandra
Good Fit
- ✅ Time-series data (IoT sensors, metrics, logs)
- ✅ Session management (billions of user sessions)
- ✅ Write-heavy workloads (millions of events/sec)
- ✅ Mobile apps (offline-first, sync later)
Poor Fit
- ❌ Complex analytics
- ❌ Ad-hoc queries
- ❌ JOIN-heavy workloads
- ❌ Small data (< 100GB)
Cassandra for Analytics Pattern
Cassandra as hot storage + separate analytics warehouse:
Operational Cassandra
(fast writes, high volume)
↓
ETL Pipeline (export to Parquet)
↓
Cloud Data Warehouse (BigQuery)
(analytics queries)
↓
Reports & Dashboards
Example: Cassandra Table Design
-- Time-series table (optimized for reads by timestamp)
CREATE TABLE metrics (
metric_name TEXT,
timestamp TIMESTAMP,
sensor_id TEXT,
value DOUBLE,
PRIMARY KEY ((metric_name, sensor_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
-- Partition key: (metric_name, sensor_id)
-- → All data for one metric-sensor combo on same partition
-- Clustering key: timestamp (DESC)
-- → Sorted by time within partition
-- → Fast range queries (recent 1 hour)
-- Query: Get last 1 hour of metrics (efficient)
SELECT * FROM metrics
WHERE metric_name = 'cpu_usage'
AND sensor_id = 'server1'
AND timestamp >= now() - INTERVAL 1 HOUR;
-- Reads single partition slice → fast!
-- Anti-query (SLOW!)
SELECT * FROM metrics
WHERE timestamp >= now() - INTERVAL 1 HOUR;
-- Scans ALL partitions (full cluster scan!)
Key Takeaways
- Cassandra is for scale, not analytics
- Use separate DWH for analytical queries
- Denormalize aggressively in Cassandra
- Design for access patterns, not flexibility
- Export to cloud DWH for complex analytics