Advanced Topics
Advanced Cassandra features for large-scale, secure, and reliable deployments. These topics are essential for production-grade systems.
Definitions & Core Concepts
Hot Partitions
Definition: A hot partition is a partition (or set of partitions) that receives a disproportionate amount of traffic compared to other partitions in the same table.
What it means: - One node becomes a bottleneck while others are idle - Limits throughput to hot partition's node capacity - Causes latency spikes and timeouts - Most common Cassandra performance issue - Result of poor partition key design or skewed data distribution
Example of Hot Partition:
CREATE TABLE user_activity (
user_id UUID,
activity_date DATE,
action TEXT,
PRIMARY KEY (user_id, activity_date, action)
);
-- Celebrity user: 1M actions per second
-- Regular user: 100 actions per second
-- Partition key = user_id
// Celebrity's partition is on Node1
// Node1 handles 10x more traffic than Node2, Node3
// Result: Node1 CPU/network saturated
// Celebrity actions slow, timeouts
// Other nodes idle
Identifying Hot Partitions: - Monitor node CPU and network traffic (should be balanced) - Use nodetool or JMX metrics - Watch for timeouts on specific queries - Query patterns show specific partition keys taking most traffic
Preventing Hot Partitions
Strategy 1: Composite Partition Key (Salting)
-- Distribute hot data across multiple partitions
CREATE TABLE celebrity_activity (
user_id UUID,
shard INT, -- NEW: 0-9
activity_date DATE,
action TEXT,
PRIMARY KEY ((user_id, shard), activity_date, action)
);
-- When inserting, randomly pick a shard (0-9)
INSERT INTO celebrity_activity (user_id, shard, activity_date, action)
VALUES (celebrity_uuid, random() % 10, today(), 'liked_video');
-- Traffic now spreads across 10 partitions instead of 1
// Single partition: 100K writes/sec
// Spread across 10: ~10K writes/sec per partition
Strategy 2: Time-Bucket Distribution
-- Already using in time-series
CREATE TABLE metrics (
sensor_id UUID,
date_bucket DATE,
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((sensor_id, date_bucket), timestamp)
);
// If sensor sends constant stream of data:
// Partition rotates daily (sensor_id, 2024-01-01) → (sensor_id, 2024-01-02)
// Load spreads over time buckets
Strategy 3: Better Partition Key
-- BAD: Partition by low-cardinality field
CREATE TABLE status_tasks (
status TEXT, -- Only "pending", "done", "failed"
task_id UUID,
PRIMARY KEY (status)
);
// All "pending" tasks → one partition → hot spot
-- BETTER: Partition by high-cardinality field
CREATE TABLE tasks (
task_id UUID, -- Millions of unique values
status TEXT,
PRIMARY KEY (task_id)
);
// Each task is own partition → distributed evenly
Multi-Datacenter Replication
Definition
Definition: Multi-DC replication is maintaining copies of data across geographically distributed data centers for disaster recovery, local reads, and high availability.
What it means: - Data replicated across multiple DCs automatically - Can read from nearest DC (low latency) - Can write to nearest DC (local quorum) - Survive entire DC failures - Data consistency eventually across DCs
NetworkTopologyStrategy
Definition: A replication strategy that controls how many replicas in each data center.
Example:
CREATE KEYSPACE app_data WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC_US_EAST': 3, -- 3 replicas in US East
'DC_EU_WEST': 3, -- 3 replicas in EU West
'DC_ASIA': 2 -- 2 replicas in Asia
};
-- Data replicated automatically:
-- Write to DC_US_EAST → replicated to DC_EU_WEST → replicated to DC_ASIA
-- Takes time (milliseconds to seconds depending on network)
Multi-DC Query Patterns
Local Quorum Writes
-- Write to nearest DC first (fast)
INSERT INTO users (user_id, name)
VALUES (user123, 'Alice')
USING CONSISTENCY LOCAL_QUORUM;
// What happens:
// 1. Coordinator in nearest DC waits for quorum in that DC only
// 2. Returns success immediately (no remote DC latency)
// 3. Remote DCs get data asynchronously
Local Quorum Reads
SELECT * FROM users WHERE user_id = 'user123'
USING CONSISTENCY LOCAL_QUORUM;
// Reads from replicas in local DC only
// Fast because no remote DC latency
// Might be slightly stale (remote DC writes pending)
Global Consistency (Across DCs)
-- Ensure all DCs have data
INSERT INTO critical_data (id, value)
VALUES (data123, 'important')
USING CONSISTENCY QUORUM; -- Global QUORUM across all DCs
// Much slower, but guarantees consistency across all regions
// Use only for critical data
Multi-DC Challenges
Network Latency:
DC_WEST (US): Coordinator at DC_WEST, replicate to:
- DC_EAST (US): 20ms away
- DC_EU: 100ms away
- DC_ASIA: 200ms away
Write with QUORUM: Wait for 2 of 3 DCs = ~100-200ms latency
Write with LOCAL_QUORUM: Wait for local replicas = ~5-20ms latency
Conflict Resolution:
Scenario: Concurrent writes from different DCs
DC_WEST: Update user name to 'Alice' at T=1000
DC_EAST: Update user name to 'Bob' at T=1001
Cassandra uses timestamp tiebreak:
- T=1001 > T=1000
- 'Bob' wins
- Both DCs eventually converge to 'Bob'
But: Race conditions possible, last-write-wins may not be correct
Security
Authentication
Definition: Mechanism to verify user identity before allowing access.
Example:
Cassandra native authentication:
- Default: No auth (disabled)
- Enable: require username/password
- Connect: cqlsh -u cassandra -p cassandra
Authorization
Definition: Mechanism to control what authenticated users can do (permissions).
Example:
-- Create user
CREATE USER alice WITH PASSWORD 'secret123';
-- Grant permissions
GRANT SELECT ON app_data.users TO alice;
GRANT MODIFY ON app_data.users TO alice;
GRANT ALL ON app_data TO admin_user;
-- Alice can SELECT/INSERT on app_data.users
// Alice cannot DELETE
// Alice cannot access other keyspaces
Encryption
Encryption in Transit
Client → Cassandra: SSL/TLS encryption
Prevents eavesdropping on network
Default: Disabled (enable for security)
Performance: ~5-10% overhead
Encryption at Rest
Data stored on disk: encrypted
Requires key management (external KMS)
Protects against stolen disks
Default: Disabled
Performance: ~10-20% overhead
Best Practices
✅ Enable authentication in production ✅ Use strong passwords (20+ chars, mixed case/numbers/symbols) ✅ Enable encryption in transit for sensitive data ✅ Enable encryption at rest for compliance ✅ Regular security updates and patches ✅ Monitor audit logs ✅ Least privilege: grant minimal permissions
Monitoring & Repair
nodetool
Definition: A command-line utility for monitoring and managing Cassandra clusters.
Common Commands:
# Check node status
nodetool status
# Check data repair status
nodetool repair keyspace_name
# View table stats
nodetool cfstats keyspace_name.table_name
# Flush memtable to disk
nodetool flush
# Check compaction progress
nodetool compactionstats
# View GC info
nodetool gcstats
# Check ring/token distribution
nodetool ring
Metrics to Monitor
| Metric | What it tells | Alert threshold |
|---|---|---|
| Read latency | Query response time | > 100ms |
| Write latency | Insert response time | > 50ms |
| GC pause time | JVM garbage collection | > 1s |
| Compaction pending | SSTable merging queue | > 10 SSTables |
| Tombstone ratio | Tombstone count / total rows | > 50% |
| Disk usage | Data + overhead | > 80% capacity |
| CPU usage | Node CPU load | > 80% |
| Network I/O | Replication traffic | Imbalanced across nodes |
Anti-Entropy Repair
Definition: Repair is a background process that synchronizes replicas and fixes data inconsistencies.
How repair works:
1. Merkle trees: Compare partition keys between replicas
2. Identify differences: Mismatches in data
3. Stream repairs: Send missing data to replicas
4. Verify: Confirm all replicas match
Use case: After node failure/recovery, sync all replicas
Triggers: Can be manual (nodetool repair) or auto-schedule
Frequency: Weekly or monthly (before/after maintenance)
Types of Repair:
| Type | Scope | Time | Use Case |
|---|---|---|---|
| Full | All partitions | 1-2 hours | After node failure |
| Incremental | Since last repair | 10-30 min | Weekly routine |
| Parallel | Multi-threaded | Faster | Multi-core systems |
Performance Tuning
Key Tuning Parameters
JVM Heap Size
# Default: 1/4 of system RAM
# Cassandra typically: 8GB - 16GB
# Set in cassandra-env.sh:
#MAX_HEAP_SIZE="8G"
#HEAP_NEWSIZE="2G"
# Rule of thumb:
// < 100GB data: 4-8GB heap
// 100GB - 1TB: 8-16GB heap
// > 1TB: 16-32GB heap
// (Monitor GC pressure)
Memtable Size
# Controls in-memory write buffer
# Default: 1/4 of heap (if 8GB heap → 2GB memtable)
# Larger memtable:
// ✓ Fewer disk writes
// ✓ Better write throughput
// ✗ Longer GC pauses
# Smaller memtable:
// ✓ Frequent disk flushes
// ✓ Lower latency spikes
// ✗ More SSTable files to compact
Bloom Filter Size
# Speeds up read lookups ("does row exist?")
# Smaller = faster checks, more false positives (rescan)
# Larger = slower checks, fewer false positives
# Tune based on workload:
// Many random reads: larger bloom filter
// Sequential reads: smaller bloom filter
Common Issues & Solutions
Issue 1: High Read Latency
Symptoms: - Queries taking 500ms+ (should be 10-50ms) - Intermittent slowness (GC spikes?)
Causes: 1. Hot partitions (uneven data distribution) 2. High tombstone ratio (many deleted rows) 3. Bloom filter misses → disk seeks 4. GC pauses (JVM garbage collection)
Solutions:
1. Shard hot partitions (add composite key)
2. Reduce gc_grace_seconds or compact more aggressively
3. Tune bloom filter, cache settings
4. Increase heap, reduce GC pauses
Issue 2: Out of Memory / High GC
Symptoms: - "GC overhead limit exceeded" - Frequent Full GC pauses (5-30 seconds) - Node becomes unresponsive during GC
Causes: 1. Heap too small for workload 2. Bloom filters/caches too large 3. Memory leaks in application 4. Compaction building large SSTables
Solutions:
1. Increase heap size (8GB → 16GB)
2. Reduce cache sizes
3. Split compaction into smaller tasks
4. Monitor and profile memory usage
Issue 3: Repair Timeouts
Symptoms:
- nodetool repair hangs or fails
- Repairs take hours for small clusters
Causes: 1. Network issues between nodes 2. Disk I/O slow (repair + compaction = loads disk) 3. Cassandra version bugs
Solutions:
1. Run repair during low-traffic window
2. Use -pr (primary replica) flag: only repair local replicas
3. Increase timeout settings
4. Consider incremental repair (faster)
Example Diagram
Partitions"] B --> C["Composite
Partition Key"] B --> D["Time
Buckets"] B --> E["Better PK
Selection"] A --> F["Multi-DC
Setup"] F --> G["NetworkTopologyStrategy"] G --> H["Local Quorum
Writes"] G --> I["Global Quorum
Critical Data"] A --> J["Security
& Monitoring"] J --> K["Auth +
Encryption"] J --> L["nodetool
repair"] J --> M["Metrics
& Alerts"] N["Performance"] --> O["Tune JVM
Heap"] N --> P["Monitor
Latency"] N --> Q["Compaction
Strategy"]
Production Checklist
Before deploying to production:
- [ ] Capacity planning: Estimate data growth, replication factor, RF
- [ ] Consistency levels: Chosen based on durability/latency requirements
- [ ] Backup strategy: Regular backups, test restoration
- [ ] Monitoring: Set up alerts for latency, GC, disk, tombstones
- [ ] Authentication: Enabled, strong passwords
- [ ] Encryption: In transit for sensitive data
- [ ] Repair schedule: Weekly or monthly incremental repair
- [ ] Compaction strategy: Selected based on workload (size-tiered, leveled, etc.)
- [ ] TTL policies: Defined for temporary data
- [ ] Testing: Load testing, failure scenario testing
- [ ] Documentation: Runbooks, emergency procedures
- [ ] Training: Team familiar with nodetool, troubleshooting
Summary
- Hot Partitions: Design composite partition keys or use sharding; monitor node balance.
- Multi-DC: Use NetworkTopologyStrategy and LOCAL_QUORUM for fast local operations.
- Security: Enable auth, enforce strong passwords, encrypt transit and rest.
- Monitoring: Track latency, GC, compaction, tombstones; use nodetool for cluster health.
- Repair: Run weekly/monthly anti-entropy repair to maintain data consistency.
- Tuning: Adjust heap, memtable, cache settings based on workload and monitoring data.