08 · Tools Ecosystem: BigQuery, PostgreSQL, Cassandra, MongoDB, Databricks

Level: Intermediate to Advanced Time to read: 18 min After reading: You'll understand the trade-offs, use cases, and architectural fit of major data warehouse and database platforms.


Platform Comparison Matrix

Feature BigQuery Snowflake Databricks PostgreSQL Cassandra MongoDB
Type OLAP/DWH OLAP/DWH OLAP+ML OLTP NoSQL NoSQL
Scaling Auto Manual Auto Vertical Horizontal Horizontal
Storage GCS S3/GCS Delta Local Distributed Local
Best for Analytics Enterprise DWH ML + Analytics Transactional Wide columns Flexible schema
Starting cost Low High Medium Free Free Free
Latency Seconds Second Seconds/Minutes Milliseconds Milliseconds Milliseconds

BigQuery (Google Cloud)

Architecture

BigQuery = Columnar Storage + Massive Parallelism

┌─────────────────────────────────────────┐
│  Query (SQL via Web UI / API)           │
├─────────────────────────────────────────┤
│  Dremel (Query Engine)                  │
│  └─ Parallel execution across 10K nodes │
├─────────────────────────────────────────┤
│  Colossus (Distributed Storage)         │
│  └─ Columnar format, 10-100x compression │
└─────────────────────────────────────────┘

Pricing: $6.25 per TB scanned
Cost Control: Set max bytes scanned per project

Strengths ⭐

Auto-scaling: No cluster management
Cheap at scale: $6.25/TB very competitive
Nested/JSON support: Can store and query nested objects
Built-in ML (BigQuery ML): Train models with SQL
Compliance: SOC 2, HIPAA, GDPR certified
Zero cold starts: Instant queries

Weaknesses

Geospatial: Limited spatial queries
Real-time ingest: Streaming is slower than competitors
Cross-region: Data replication is manual

Use Cases

  • ✅ Fast-growing analytics (auto-scales)
  • ✅ Cost-conscious (pay-per-query with 1 MB minimum)
  • ✅ GCP-native (integrates with Dataflow, Cloud Storage)
  • ❌ Real-time operational queries

Example: Loading & Querying

-- Load from GCS Parquet
LOAD DATA INTO dataset.fact_sales
FROM FILES (
  format = 'PARQUET',
  uris = ['gs://my-bucket/sales/*.parquet']
);

-- Query with auto-scaling (queries parallelized across 100K slots)
SELECT 
  DATE(sale_date) as sale_date,
  product_category,
  SUM(amount) as revenue,
  COUNT(DISTINCT customer_id) as unique_customers
FROM dataset.fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 100;

Snowflake

Architecture

Snowflake = Shared Cloud Storage + Separate Compute

┌─────────────────────────────┐
│  Compute Clusters           │
│  ├─ Cluster 1 (10 nodes)    │
│  ├─ Cluster 2 (5 nodes)     │
│  └─ Cluster N (scale as needed)
├─────────────────────────────┤
│  Cloud Storage              │
│  (S3 / Azure Blob / GCS)    │
│  └─ Data shared across clusters
└─────────────────────────────┘

Pricing: $1-4 per credit (1 credit = 1 server⋅hour)
Scaling: Manual compute provisioning

Strengths

Separation of compute & storage: Scale independently
Time-travel: Query historical snapshots (CLONE, UNDROP)
Seamless scaling: Add clusters for concurrent workloads
Zero-copy cloning: Clone tables (<1ms, no data copy)
Account sharing: Share data across accounts (marketplace)

Weaknesses

Higher minimum cost: ~$1,000/month minimum
Manual scaling: Must predict and configure clusters
Shared cloud storage: Data can be expensive to move

Use Cases

  • ✅ Enterprise data warehouses (reliable, scalable)
  • ✅ Self-service BI (multiple clusters per department)
  • ✅ Data sharing (Snowflake marketplace)
  • ❌ Cost-conscious startups

Databricks (Lakehouse)

Architecture

Databricks = Delta Lake (Parquet) + Spark + Notebooks

┌────────────────────────────────┐
│  Workspace (Notebooks + Jobs)  │
├────────────────────────────────┤
│  Spark Clusters                │
│  ├─ Auto-scaling clusters      │
│  └─ Run SQL, Python, Scala, R  │
├────────────────────────────────┤
│  Delta Lake (S3 / Azure)       │
│  └─ ACID transactions on files │
└────────────────────────────────┘

Pricing: $0.30-0.50 per DBU-hour
Best for: Analytics + ML + Data Science

Strengths

Unified storage: Single Delta Lake for all workloads
ACID on data lakes: Transactional guarantee on files
ML ecosystem: MLflow, feature store built-in
Flexible: SQL, Python, Scala, R in same notebook
Cost-effective: Cheaper than Snowflake at scale

Weaknesses

Operationally complex: Need to manage clusters
Multi-language: Can be harder to standardize
Data governance: Less mature than Snowflake/BigQuery

Use Cases

  • ✅ ML/Data Science workflows (built-in notebooks)
  • ✅ Multi-language teams (SQL + Python)
  • ✅ Cost-conscious at scale (cheaper DBUs)
  • ❌ Non-technical BI teams (prefer Snowflake SQL)

PostgreSQL (OLTP Reference)

Why Include PostgreSQL?

PostgreSQL is the "reference OLTP" — interviews often ask: "Why not just use PostgreSQL?"

Strengths

Transactions: Full ACID, FK constraints
Reliability: Battle-tested, used by Fortune 500
Extensibility: Custom data types, functions
JSON support: JSONB for semi-structured
Free: No licensing costs

Weaknesses (For Analytics)

No columnar: Row-based storage, slow aggregations
Scaling: Replication (read replicas), not distributed
Cost: Vertical scaling gets expensive
Contention: OLTP & OLAP on same system conflicts

When It Works For Analytics

-- PostgreSQL: Great for operational reporting
SELECT 
  DATE(order_date) as date,
  COUNT(*) as orders,
  SUM(total) as revenue
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1 DESC;

-- Fast: Only 30 days of data, index-friendly
-- Fails with 5 years of data (table scan too slow)

Interview Takeaway

"PostgreSQL is excellent for OLTP and small analytical queries. At scale, separate an OLAP system (BigQuery/Snowflake) from your operational database."


Cassandra (Wide-Column NoSQL)

Architecture: Distributed, Eventually Consistent

Cassandra = Distributed Hash Table

Ring of Nodes:
├─ Node 1: Keys A-D
├─ Node 2: Keys E-H  (replicated from Node 1)
├─ Node 3: Keys I-L  (replicated from Node 2)
└─ Node 4: Keys M-Z  (replicated from Node 3)

Write: "Put key=customer_15 value=..."
  → Hash(customer_15) → Node 2
  → Replicate to Node 3 and 4
  → Return immediately (async)

Consistency: Eventually consistent (AP in CAP theorem)

Schema: Wide Rows

-- Cassandra: Wide-row design
CREATE TABLE user_events (
  user_id UUID,
  event_timestamp BIGINT,  -- MSB timestamp for sorting
  event_type TEXT,
  event_details MAP<TEXT, TEXT>,
  PRIMARY KEY (user_id, event_timestamp)  -- Partition + Clustering
) WITH CLUSTERING ORDER BY (event_timestamp DESC);

-- Single partition query (fast)
SELECT * FROM user_events 
WHERE user_id = '12345'
LIMIT 100;

-- Result: Last 100 events for user_12345 (milliseconds)

Strengths

Horizontal scale: Add nodes, cluster grows
High availability: No single point of failure
Writes are fast: Optimized for write-heavy
Time-series friendly: Clustering keys are time-ordered

Weaknesses

No complex JOINs: Denormalization required
Eventual consistency: Can read old values
Limited aggregations: GROUP BY not well supported
Not for analytics: No column compression

Use Cases

  • ✅ Time-series (metrics, events)
  • ✅ High-write IoT data
  • ✅ 24/7 availability (no downtime)
  • ❌ Analytics (use BigQuery instead)
  • ❌ Complex queries

MongoDB (Document NoSQL)

Schema: Flexible Documents

// MongoDB: Flexible schema
db.customers.insertOne({
  _id: ObjectId(),
  customer_id: 12345,
  name: "Alice",
  email: "alice@example.com",
  address: {
    street: "123 Main",
    city: "New York",
    zip: "10001"
  },
  orders: [
    { order_id: 1, date: "2024-01-01", total: 99.99 },
    { order_id: 2, date: "2024-01-15", total: 149.99 }
  ]
});

// Query with aggregation pipeline
db.customers.aggregate([
  { $match: { "address.city": "New York" } },
  { $unwind: "$orders" },
  { $group: { _id: "$_id", total_spent: { $sum: "$orders.total" } } },
  { $sort: { total_spent: -1 } }
]);

Strengths

Flexible schema: No rigid table definition
Nested documents: Natural representation of hierarchies
Horizontal scale: Sharding across servers
Developer friendly: JSON-like queries

Weaknesses

Consistency: Transactions only within documents
Duplicate data: No JOIN, must denormalize
Storage: More verbose than columnar (worse compression)
Analytics: Slow for large GROUP BY

Use Cases

  • ✅ Content management (flexible schema)
  • ✅ Real-time applications (mobile apps)
  • ✅ E-commerce (product catalogs)
  • ❌ Financial transactions (need ACID)
  • ❌ Analytics (use BigQuery)

Decision Tree: Which Platform?

graph TD
    A["What's your primary use case?"]

    A -->|"Analytics & BI"| B["> 1 TB data?"]
    A -->|"Real-time Ops"| C["Distributed needed?"]
    A -->|"Flexible Schema"| D["Analytics required?"]
    A -->|"Time-Series"| E["Write-heavy?"]

    B -->|"Yes"| F["BigQuery<br/>or Snowflake"]
    B -->|"No"| G["PostgreSQL+BI"]

    C -->|"Yes"| H["MongoDB<br/>or Cassandra"]
    C -->|"No"| I["PostgreSQL"]

    D -->|"Yes"| J["MongoDB<br/>(small scale)"]
    D -->|"No"| K["PostgreSQL"]

    E -->|"Yes"| L["Cassandra"]
    E -->|"No"| M["PostgreSQL<br/>or MongoDB"]

Deep-Dives

Deep Dive: BigQuery — BigQuery-specific: nested queries, ML, cost optimization.

Deep Dive: PostgreSQL — OLTP patterns, replication, analytical extensions (Citus).

Deep Dive: Cassandra — Distributed architecture, token ring, consistency models.

Deep Dive: MongoDB — Sharding, transactions, aggregation pipeline.

Deep Dive: Databricks — Delta Lake, ML workflows, cost optimization.


Key Takeaways

BigQuery for analytics with minimal ops, Snowflake for enterprise control.

PostgreSQL for OLTP, Cassandra for distributed writes.

MongoDB for flexible schemas, Databricks for ML + Analytics.

Separate OLTP from OLAP — use different systems.

Choose based on access pattern, not just data volume.


Practice Questions

  1. When would you choose Snowflake over BigQuery?
  2. Why can't you use PostgreSQL for a 10 TB analytics warehouse?
  3. What's the difference between Cassandra and MongoDB's failure models?
  4. Design a data warehouse using both PostgreSQL (OLTP) and BigQuery (OLAP).
  5. When would Databricks be better than BigQuery?