08 · Tools Ecosystem: BigQuery, PostgreSQL, Cassandra, MongoDB, Databricks

Level: Intermediate to Advanced Time to read: 18 min After reading: You'll understand the trade-offs, use cases, and architectural fit of major data warehouse and database platforms.

Platform Comparison Matrix

Feature	BigQuery	Snowflake	Databricks	PostgreSQL	Cassandra	MongoDB
Type	OLAP/DWH	OLAP/DWH	OLAP+ML	OLTP	NoSQL	NoSQL
Scaling	Auto	Manual	Auto	Vertical	Horizontal	Horizontal
Storage	GCS	S3/GCS	Delta	Local	Distributed	Local
Best for	Analytics	Enterprise DWH	ML + Analytics	Transactional	Wide columns	Flexible schema
Starting cost	Low	High	Medium	Free	Free	Free
Latency	Seconds	Second	Seconds/Minutes	Milliseconds	Milliseconds	Milliseconds

BigQuery (Google Cloud)

Architecture

BigQuery = Columnar Storage + Massive Parallelism

┌─────────────────────────────────────────┐
│  Query (SQL via Web UI / API)           │
├─────────────────────────────────────────┤
│  Dremel (Query Engine)                  │
│  └─ Parallel execution across 10K nodes │
├─────────────────────────────────────────┤
│  Colossus (Distributed Storage)         │
│  └─ Columnar format, 10-100x compression │
└─────────────────────────────────────────┘

Pricing: $6.25 per TB scanned
Cost Control: Set max bytes scanned per project

Strengths ⭐

✅ Auto-scaling: No cluster management
✅ Cheap at scale: $6.25/TB very competitive
✅ Nested/JSON support: Can store and query nested objects
✅ Built-in ML (BigQuery ML): Train models with SQL
✅ Compliance: SOC 2, HIPAA, GDPR certified
✅ Zero cold starts: Instant queries

Weaknesses

❌ Geospatial: Limited spatial queries
❌ Real-time ingest: Streaming is slower than competitors
❌ Cross-region: Data replication is manual

Use Cases

✅ Fast-growing analytics (auto-scales)
✅ Cost-conscious (pay-per-query with 1 MB minimum)
✅ GCP-native (integrates with Dataflow, Cloud Storage)
❌ Real-time operational queries

Example: Loading & Querying

-- Load from GCS Parquet
LOAD DATA INTO dataset.fact_sales
FROM FILES (
  format = 'PARQUET',
  uris = ['gs://my-bucket/sales/*.parquet']
);

-- Query with auto-scaling (queries parallelized across 100K slots)
SELECT 
  DATE(sale_date) as sale_date,
  product_category,
  SUM(amount) as revenue,
  COUNT(DISTINCT customer_id) as unique_customers
FROM dataset.fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 100;

Snowflake

Architecture

Snowflake = Shared Cloud Storage + Separate Compute

┌─────────────────────────────┐
│  Compute Clusters           │
│  ├─ Cluster 1 (10 nodes)    │
│  ├─ Cluster 2 (5 nodes)     │
│  └─ Cluster N (scale as needed)
├─────────────────────────────┤
│  Cloud Storage              │
│  (S3 / Azure Blob / GCS)    │
│  └─ Data shared across clusters
└─────────────────────────────┘

Pricing: $1-4 per credit (1 credit = 1 server⋅hour)
Scaling: Manual compute provisioning

Strengths

✅ Separation of compute & storage: Scale independently
✅ Time-travel: Query historical snapshots (CLONE, UNDROP)
✅ Seamless scaling: Add clusters for concurrent workloads
✅ Zero-copy cloning: Clone tables (<1ms, no data copy)
✅ Account sharing: Share data across accounts (marketplace)

Weaknesses

❌ Higher minimum cost: ~$1,000/month minimum
❌ Manual scaling: Must predict and configure clusters
❌ Shared cloud storage: Data can be expensive to move

Use Cases

✅ Enterprise data warehouses (reliable, scalable)
✅ Self-service BI (multiple clusters per department)
✅ Data sharing (Snowflake marketplace)
❌ Cost-conscious startups

Databricks (Lakehouse)

Architecture

Databricks = Delta Lake (Parquet) + Spark + Notebooks

┌────────────────────────────────┐
│  Workspace (Notebooks + Jobs)  │
├────────────────────────────────┤
│  Spark Clusters                │
│  ├─ Auto-scaling clusters      │
│  └─ Run SQL, Python, Scala, R  │
├────────────────────────────────┤
│  Delta Lake (S3 / Azure)       │
│  └─ ACID transactions on files │
└────────────────────────────────┘

Pricing: $0.30-0.50 per DBU-hour
Best for: Analytics + ML + Data Science

Strengths

✅ Unified storage: Single Delta Lake for all workloads
✅ ACID on data lakes: Transactional guarantee on files
✅ ML ecosystem: MLflow, feature store built-in
✅ Flexible: SQL, Python, Scala, R in same notebook
✅ Cost-effective: Cheaper than Snowflake at scale

Weaknesses

❌ Operationally complex: Need to manage clusters
❌ Multi-language: Can be harder to standardize
❌ Data governance: Less mature than Snowflake/BigQuery

Use Cases

✅ ML/Data Science workflows (built-in notebooks)
✅ Multi-language teams (SQL + Python)
✅ Cost-conscious at scale (cheaper DBUs)
❌ Non-technical BI teams (prefer Snowflake SQL)

PostgreSQL (OLTP Reference)

Why Include PostgreSQL?

PostgreSQL is the "reference OLTP" — interviews often ask: "Why not just use PostgreSQL?"

Strengths

✅ Transactions: Full ACID, FK constraints
✅ Reliability: Battle-tested, used by Fortune 500
✅ Extensibility: Custom data types, functions
✅ JSON support: JSONB for semi-structured
✅ Free: No licensing costs

Weaknesses (For Analytics)

❌ No columnar: Row-based storage, slow aggregations
❌ Scaling: Replication (read replicas), not distributed
❌ Cost: Vertical scaling gets expensive
❌ Contention: OLTP & OLAP on same system conflicts

When It Works For Analytics

-- PostgreSQL: Great for operational reporting
SELECT 
  DATE(order_date) as date,
  COUNT(*) as orders,
  SUM(total) as revenue
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1 DESC;

-- Fast: Only 30 days of data, index-friendly
-- Fails with 5 years of data (table scan too slow)

Interview Takeaway

"PostgreSQL is excellent for OLTP and small analytical queries. At scale, separate an OLAP system (BigQuery/Snowflake) from your operational database."

Cassandra (Wide-Column NoSQL)

Architecture: Distributed, Eventually Consistent

Cassandra = Distributed Hash Table

Ring of Nodes:
│
├─ Node 1: Keys A-D
├─ Node 2: Keys E-H  (replicated from Node 1)
├─ Node 3: Keys I-L  (replicated from Node 2)
└─ Node 4: Keys M-Z  (replicated from Node 3)

Write: "Put key=customer_15 value=..."
  → Hash(customer_15) → Node 2
  → Replicate to Node 3 and 4
  → Return immediately (async)

Consistency: Eventually consistent (AP in CAP theorem)

Schema: Wide Rows

-- Cassandra: Wide-row design
CREATE TABLE user_events (
  user_id UUID,
  event_timestamp BIGINT,  -- MSB timestamp for sorting
  event_type TEXT,
  event_details MAP<TEXT, TEXT>,
  PRIMARY KEY (user_id, event_timestamp)  -- Partition + Clustering
) WITH CLUSTERING ORDER BY (event_timestamp DESC);

-- Single partition query (fast)
SELECT * FROM user_events 
WHERE user_id = '12345'
LIMIT 100;

-- Result: Last 100 events for user_12345 (milliseconds)

Strengths

✅ Horizontal scale: Add nodes, cluster grows
✅ High availability: No single point of failure
✅ Writes are fast: Optimized for write-heavy
✅ Time-series friendly: Clustering keys are time-ordered

Weaknesses

❌ No complex JOINs: Denormalization required
❌ Eventual consistency: Can read old values
❌ Limited aggregations: GROUP BY not well supported
❌ Not for analytics: No column compression

Use Cases

✅ Time-series (metrics, events)
✅ High-write IoT data
✅ 24/7 availability (no downtime)
❌ Analytics (use BigQuery instead)
❌ Complex queries

MongoDB (Document NoSQL)

Schema: Flexible Documents

// MongoDB: Flexible schema
db.customers.insertOne({
  _id: ObjectId(),
  customer_id: 12345,
  name: "Alice",
  email: "alice@example.com",
  address: {
    street: "123 Main",
    city: "New York",
    zip: "10001"
  },
  orders: [
    { order_id: 1, date: "2024-01-01", total: 99.99 },
    { order_id: 2, date: "2024-01-15", total: 149.99 }
  ]
});

// Query with aggregation pipeline
db.customers.aggregate([
  { $match: { "address.city": "New York" } },
  { $unwind: "$orders" },
  { $group: { _id: "$_id", total_spent: { $sum: "$orders.total" } } },
  { $sort: { total_spent: -1 } }
]);

Strengths

✅ Flexible schema: No rigid table definition
✅ Nested documents: Natural representation of hierarchies
✅ Horizontal scale: Sharding across servers
✅ Developer friendly: JSON-like queries

Weaknesses

❌ Consistency: Transactions only within documents
❌ Duplicate data: No JOIN, must denormalize
❌ Storage: More verbose than columnar (worse compression)
❌ Analytics: Slow for large GROUP BY

Use Cases

✅ Content management (flexible schema)
✅ Real-time applications (mobile apps)
✅ E-commerce (product catalogs)
❌ Financial transactions (need ACID)
❌ Analytics (use BigQuery)

Decision Tree: Which Platform?

graph TD
    A["What's your primary use case?"]

    A -->|"Analytics & BI"| B["> 1 TB data?"]
    A -->|"Real-time Ops"| C["Distributed needed?"]
    A -->|"Flexible Schema"| D["Analytics required?"]
    A -->|"Time-Series"| E["Write-heavy?"]

    B -->|"Yes"| F["BigQuery<br/>or Snowflake"]
    B -->|"No"| G["PostgreSQL+BI"]

    C -->|"Yes"| H["MongoDB<br/>or Cassandra"]
    C -->|"No"| I["PostgreSQL"]

    D -->|"Yes"| J["MongoDB<br/>(small scale)"]
    D -->|"No"| K["PostgreSQL"]

    E -->|"Yes"| L["Cassandra"]
    E -->|"No"| M["PostgreSQL<br/>or MongoDB"]

Deep-Dives

→ Deep Dive: BigQuery — BigQuery-specific: nested queries, ML, cost optimization.

→ Deep Dive: PostgreSQL — OLTP patterns, replication, analytical extensions (Citus).

→ Deep Dive: Cassandra — Distributed architecture, token ring, consistency models.

→ Deep Dive: MongoDB — Sharding, transactions, aggregation pipeline.

→ Deep Dive: Databricks — Delta Lake, ML workflows, cost optimization.

Key Takeaways

✅ BigQuery for analytics with minimal ops, Snowflake for enterprise control.

✅ PostgreSQL for OLTP, Cassandra for distributed writes.

✅ MongoDB for flexible schemas, Databricks for ML + Analytics.

✅ Separate OLTP from OLAP — use different systems.

✅ Choose based on access pattern, not just data volume.

Practice Questions

When would you choose Snowflake over BigQuery?
Why can't you use PostgreSQL for a 10 TB analytics warehouse?
What's the difference between Cassandra and MongoDB's failure models?
Design a data warehouse using both PostgreSQL (OLTP) and BigQuery (OLAP).
When would Databricks be better than BigQuery?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search