08 · Tools Ecosystem: BigQuery, PostgreSQL, Cassandra, MongoDB, Databricks
Level: Intermediate to Advanced Time to read: 18 min After reading: You'll understand the trade-offs, use cases, and architectural fit of major data warehouse and database platforms.
Platform Comparison Matrix
| Feature | BigQuery | Snowflake | Databricks | PostgreSQL | Cassandra | MongoDB |
|---|---|---|---|---|---|---|
| Type | OLAP/DWH | OLAP/DWH | OLAP+ML | OLTP | NoSQL | NoSQL |
| Scaling | Auto | Manual | Auto | Vertical | Horizontal | Horizontal |
| Storage | GCS | S3/GCS | Delta | Local | Distributed | Local |
| Best for | Analytics | Enterprise DWH | ML + Analytics | Transactional | Wide columns | Flexible schema |
| Starting cost | Low | High | Medium | Free | Free | Free |
| Latency | Seconds | Second | Seconds/Minutes | Milliseconds | Milliseconds | Milliseconds |
BigQuery (Google Cloud)
Architecture
BigQuery = Columnar Storage + Massive Parallelism
┌─────────────────────────────────────────┐
│ Query (SQL via Web UI / API) │
├─────────────────────────────────────────┤
│ Dremel (Query Engine) │
│ └─ Parallel execution across 10K nodes │
├─────────────────────────────────────────┤
│ Colossus (Distributed Storage) │
│ └─ Columnar format, 10-100x compression │
└─────────────────────────────────────────┘
Pricing: $6.25 per TB scanned
Cost Control: Set max bytes scanned per project
Strengths ⭐
✅ Auto-scaling: No cluster management
✅ Cheap at scale: $6.25/TB very competitive
✅ Nested/JSON support: Can store and query nested objects
✅ Built-in ML (BigQuery ML): Train models with SQL
✅ Compliance: SOC 2, HIPAA, GDPR certified
✅ Zero cold starts: Instant queries
Weaknesses
❌ Geospatial: Limited spatial queries
❌ Real-time ingest: Streaming is slower than competitors
❌ Cross-region: Data replication is manual
Use Cases
- ✅ Fast-growing analytics (auto-scales)
- ✅ Cost-conscious (pay-per-query with 1 MB minimum)
- ✅ GCP-native (integrates with Dataflow, Cloud Storage)
- ❌ Real-time operational queries
Example: Loading & Querying
-- Load from GCS Parquet
LOAD DATA INTO dataset.fact_sales
FROM FILES (
format = 'PARQUET',
uris = ['gs://my-bucket/sales/*.parquet']
);
-- Query with auto-scaling (queries parallelized across 100K slots)
SELECT
DATE(sale_date) as sale_date,
product_category,
SUM(amount) as revenue,
COUNT(DISTINCT customer_id) as unique_customers
FROM dataset.fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 100;
Snowflake
Architecture
Snowflake = Shared Cloud Storage + Separate Compute
┌─────────────────────────────┐
│ Compute Clusters │
│ ├─ Cluster 1 (10 nodes) │
│ ├─ Cluster 2 (5 nodes) │
│ └─ Cluster N (scale as needed)
├─────────────────────────────┤
│ Cloud Storage │
│ (S3 / Azure Blob / GCS) │
│ └─ Data shared across clusters
└─────────────────────────────┘
Pricing: $1-4 per credit (1 credit = 1 server⋅hour)
Scaling: Manual compute provisioning
Strengths
✅ Separation of compute & storage: Scale independently
✅ Time-travel: Query historical snapshots (CLONE, UNDROP)
✅ Seamless scaling: Add clusters for concurrent workloads
✅ Zero-copy cloning: Clone tables (<1ms, no data copy)
✅ Account sharing: Share data across accounts (marketplace)
Weaknesses
❌ Higher minimum cost: ~$1,000/month minimum
❌ Manual scaling: Must predict and configure clusters
❌ Shared cloud storage: Data can be expensive to move
Use Cases
- ✅ Enterprise data warehouses (reliable, scalable)
- ✅ Self-service BI (multiple clusters per department)
- ✅ Data sharing (Snowflake marketplace)
- ❌ Cost-conscious startups
Databricks (Lakehouse)
Architecture
Databricks = Delta Lake (Parquet) + Spark + Notebooks
┌────────────────────────────────┐
│ Workspace (Notebooks + Jobs) │
├────────────────────────────────┤
│ Spark Clusters │
│ ├─ Auto-scaling clusters │
│ └─ Run SQL, Python, Scala, R │
├────────────────────────────────┤
│ Delta Lake (S3 / Azure) │
│ └─ ACID transactions on files │
└────────────────────────────────┘
Pricing: $0.30-0.50 per DBU-hour
Best for: Analytics + ML + Data Science
Strengths
✅ Unified storage: Single Delta Lake for all workloads
✅ ACID on data lakes: Transactional guarantee on files
✅ ML ecosystem: MLflow, feature store built-in
✅ Flexible: SQL, Python, Scala, R in same notebook
✅ Cost-effective: Cheaper than Snowflake at scale
Weaknesses
❌ Operationally complex: Need to manage clusters
❌ Multi-language: Can be harder to standardize
❌ Data governance: Less mature than Snowflake/BigQuery
Use Cases
- ✅ ML/Data Science workflows (built-in notebooks)
- ✅ Multi-language teams (SQL + Python)
- ✅ Cost-conscious at scale (cheaper DBUs)
- ❌ Non-technical BI teams (prefer Snowflake SQL)
PostgreSQL (OLTP Reference)
Why Include PostgreSQL?
PostgreSQL is the "reference OLTP" — interviews often ask: "Why not just use PostgreSQL?"
Strengths
✅ Transactions: Full ACID, FK constraints
✅ Reliability: Battle-tested, used by Fortune 500
✅ Extensibility: Custom data types, functions
✅ JSON support: JSONB for semi-structured
✅ Free: No licensing costs
Weaknesses (For Analytics)
❌ No columnar: Row-based storage, slow aggregations
❌ Scaling: Replication (read replicas), not distributed
❌ Cost: Vertical scaling gets expensive
❌ Contention: OLTP & OLAP on same system conflicts
When It Works For Analytics
-- PostgreSQL: Great for operational reporting
SELECT
DATE(order_date) as date,
COUNT(*) as orders,
SUM(total) as revenue
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1 DESC;
-- Fast: Only 30 days of data, index-friendly
-- Fails with 5 years of data (table scan too slow)
Interview Takeaway
"PostgreSQL is excellent for OLTP and small analytical queries. At scale, separate an OLAP system (BigQuery/Snowflake) from your operational database."
Cassandra (Wide-Column NoSQL)
Architecture: Distributed, Eventually Consistent
Cassandra = Distributed Hash Table
Ring of Nodes:
│
├─ Node 1: Keys A-D
├─ Node 2: Keys E-H (replicated from Node 1)
├─ Node 3: Keys I-L (replicated from Node 2)
└─ Node 4: Keys M-Z (replicated from Node 3)
Write: "Put key=customer_15 value=..."
→ Hash(customer_15) → Node 2
→ Replicate to Node 3 and 4
→ Return immediately (async)
Consistency: Eventually consistent (AP in CAP theorem)
Schema: Wide Rows
-- Cassandra: Wide-row design
CREATE TABLE user_events (
user_id UUID,
event_timestamp BIGINT, -- MSB timestamp for sorting
event_type TEXT,
event_details MAP<TEXT, TEXT>,
PRIMARY KEY (user_id, event_timestamp) -- Partition + Clustering
) WITH CLUSTERING ORDER BY (event_timestamp DESC);
-- Single partition query (fast)
SELECT * FROM user_events
WHERE user_id = '12345'
LIMIT 100;
-- Result: Last 100 events for user_12345 (milliseconds)
Strengths
✅ Horizontal scale: Add nodes, cluster grows
✅ High availability: No single point of failure
✅ Writes are fast: Optimized for write-heavy
✅ Time-series friendly: Clustering keys are time-ordered
Weaknesses
❌ No complex JOINs: Denormalization required
❌ Eventual consistency: Can read old values
❌ Limited aggregations: GROUP BY not well supported
❌ Not for analytics: No column compression
Use Cases
- ✅ Time-series (metrics, events)
- ✅ High-write IoT data
- ✅ 24/7 availability (no downtime)
- ❌ Analytics (use BigQuery instead)
- ❌ Complex queries
MongoDB (Document NoSQL)
Schema: Flexible Documents
// MongoDB: Flexible schema
db.customers.insertOne({
_id: ObjectId(),
customer_id: 12345,
name: "Alice",
email: "alice@example.com",
address: {
street: "123 Main",
city: "New York",
zip: "10001"
},
orders: [
{ order_id: 1, date: "2024-01-01", total: 99.99 },
{ order_id: 2, date: "2024-01-15", total: 149.99 }
]
});
// Query with aggregation pipeline
db.customers.aggregate([
{ $match: { "address.city": "New York" } },
{ $unwind: "$orders" },
{ $group: { _id: "$_id", total_spent: { $sum: "$orders.total" } } },
{ $sort: { total_spent: -1 } }
]);
Strengths
✅ Flexible schema: No rigid table definition
✅ Nested documents: Natural representation of hierarchies
✅ Horizontal scale: Sharding across servers
✅ Developer friendly: JSON-like queries
Weaknesses
❌ Consistency: Transactions only within documents
❌ Duplicate data: No JOIN, must denormalize
❌ Storage: More verbose than columnar (worse compression)
❌ Analytics: Slow for large GROUP BY
Use Cases
- ✅ Content management (flexible schema)
- ✅ Real-time applications (mobile apps)
- ✅ E-commerce (product catalogs)
- ❌ Financial transactions (need ACID)
- ❌ Analytics (use BigQuery)
Decision Tree: Which Platform?
graph TD
A["What's your primary use case?"]
A -->|"Analytics & BI"| B["> 1 TB data?"]
A -->|"Real-time Ops"| C["Distributed needed?"]
A -->|"Flexible Schema"| D["Analytics required?"]
A -->|"Time-Series"| E["Write-heavy?"]
B -->|"Yes"| F["BigQuery<br/>or Snowflake"]
B -->|"No"| G["PostgreSQL+BI"]
C -->|"Yes"| H["MongoDB<br/>or Cassandra"]
C -->|"No"| I["PostgreSQL"]
D -->|"Yes"| J["MongoDB<br/>(small scale)"]
D -->|"No"| K["PostgreSQL"]
E -->|"Yes"| L["Cassandra"]
E -->|"No"| M["PostgreSQL<br/>or MongoDB"]
Deep-Dives
→ Deep Dive: BigQuery — BigQuery-specific: nested queries, ML, cost optimization.
→ Deep Dive: PostgreSQL — OLTP patterns, replication, analytical extensions (Citus).
→ Deep Dive: Cassandra — Distributed architecture, token ring, consistency models.
→ Deep Dive: MongoDB — Sharding, transactions, aggregation pipeline.
→ Deep Dive: Databricks — Delta Lake, ML workflows, cost optimization.
Key Takeaways
✅ BigQuery for analytics with minimal ops, Snowflake for enterprise control.
✅ PostgreSQL for OLTP, Cassandra for distributed writes.
✅ MongoDB for flexible schemas, Databricks for ML + Analytics.
✅ Separate OLTP from OLAP — use different systems.
✅ Choose based on access pattern, not just data volume.
Practice Questions
- When would you choose Snowflake over BigQuery?
- Why can't you use PostgreSQL for a 10 TB analytics warehouse?
- What's the difference between Cassandra and MongoDB's failure models?
- Design a data warehouse using both PostgreSQL (OLTP) and BigQuery (OLAP).
- When would Databricks be better than BigQuery?