8.05 · Deep Dive: Databricks for ML + Analytics
Level: Advanced Time to read: 18 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand Databricks architecture, Delta Lake, ML workflows, and cost optimization strategies.
Databricks: Unified Analytics Platform
Databricks = Apache Spark + Delta Lake + ML workflows + BI tools (all managed)
Architecture
Databricks Platform
┌─────────────────────────────────────┐
│ Workspace (Web UI + Notebooks) │
├─────────────────────────────────────┤
│ Compute Cluster Management │
│ (Auto-scaling, Spot instances) │
├─────────────────────────────────────┤
│ Apache Spark Engine │
│ (Distributed SQL, Python, Scala) │
├─────────────────────────────────────┤
│ Delta Lake (ACID tables) │
│ (Parquet + transaction log) │
├─────────────────────────────────────┤
│ Cloud Storage (S3, ADLS, GCS) │
└─────────────────────────────────────┘
Strengths
✅ Unified platform - SQL, Python, Scala, R in one place ✅ Delta Lake - ACID guarantees on data lake (schema on read → schema enforced) ✅ Auto-scaling - Clusters scale up/down based on workload ✅ ML-native - MLflow for experiment tracking, model registry ✅ Cost-effective - Spot instances, auto-termination ✅ Data governance - Unity Catalog for lineage, access control ✅ Fast queries - Photon acceleration for vectorized execution
Weaknesses
❌ Complex pricing model - DBUs (Databricks Units) can be expensive ❌ Cluster startup time - 2-5 minutes to provision cluster ❌ Learning curve - Spark concepts (RDDs, DAGs, partitions) are complex ❌ Overkill for small data - Better to use BigQuery/Snowflake for < 10TB ❌ Cluster management - Still need to manage clusters (unlike serverless BigQuery)
When to Use Databricks
Good Fit
- ✅ ML pipelines - MLflow tracking, model registry, feature store
- ✅ Large-scale ETL - Multi-cluster, terabyte-scale transformations
- ✅ Real-time + batch - Structured Streaming + batch jobs
- ✅ Data lake analytics - Query multiple data formats (Parquet, CSV, JSON)
- ✅ Team collaboration - Shared notebooks, version control integration
- ✅ Cost-sensitive at scale - Spot instances save 70% vs. on-demand
Poor Fit
- ❌ Startup with < 1TB data - Overhead of platform
- ❌ Simple BI dashboards - Use BigQuery/Snowflake
- ❌ OLTP workloads - Use PostgreSQL/MySQL
- ❌ Real-time dashboards - 5-minute latency from Spark jobs
Delta Lake: The Game-Changer
Problem: Data lakes (Parquet on S3) have no schema enforcement, no ACID guarantees
Traditional Data Lake (S3 + Parquet):
s3://my-bucket/
├── sales/
│ ├── 2024-01-01.parquet
│ ├── 2024-01-02.parquet
│ └── ... (100+ files)
│
Problem:
❌ No schema evolution (what if column added?)
❌ No ACID (partial writes fail, data corrupt)
❌ No rollback (deleted data is gone)
❌ No time travel (can't query historical state)
Solution: Delta Lake
Delta Lake (S3 + Parquet + Transaction Log):
s3://my-bucket/
├── sales/
│ ├── _delta_log/
│ │ ├── 00000000000000000000.json (transaction log)
│ │ └── ... (version history)
│ ├── part-00000.parquet
│ └── part-00001.parquet
│
Benefits:
✅ Schema enforcement (DDL prevents invalid data)
✅ ACID transactions (all-or-nothing writes)
✅ Rollback (revert to previous version)
✅ Time travel (query data as of specific time)
✅ Data lineage (track changes)
Example: ACID Transactions
# Databricks: ACID UPDATE (impossible in regular data lakes!)
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "s3://bucket/sales")
# Update with ACID guarantees
deltaTable.update(
condition="customer_id = 12345",
set={"status": "'REFUNDED'", "amount": "amount * -1"}
)
# If update fails halfway, entire transaction rolls back
# Data lake remains consistent!
# Time travel: Query as of 1 hour ago
spark.sql("""
SELECT * FROM delta.`s3://bucket/sales`
VERSION AS OF 123
""")
# Merge (upsert): Insert or update based on condition
deltaTable.merge(
source_df,
"target.customer_id = source.customer_id"
).whenMatched().update(
set={"email": "source.email"}
).whenNotMatched().insert(
values={"*": "*"}
).execute()
Databricks for ML + Analytics
Architecture: Complete ML Workflow
1. Data Ingestion
↓ (PySpark, SQL)
2. Feature Engineering
↓ (Databricks Feature Store)
3. Model Training
↓ (MLflow tracking experiments)
4. Model Registry
↓ (Stage: Dev → Staging → Production)
5. Batch/Streaming Inference
↓ (Apply model to new data)
6. Monitoring
↓ (MLflow tracking metrics)
Example: End-to-End ML Pipeline
import mlflow
from sklearn.ensemble import RandomForestClassifier
# Enable MLflow tracking
mlflow.set_experiment("/Users/data-eng/customer-churn")
with mlflow.start_run(run_name="rf-v1"):
# Load data
train_df = spark.read.delta("dbfs:/data/customer_features_train")
X_train = train_df.toPandas()
y_train = X_train.pop("is_churned")
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Log metrics
accuracy = model.score(X_train, y_train)
mlflow.log_metric("accuracy", accuracy)
# Log hyperparameters
mlflow.log_params({
"n_estimators": 100,
"max_depth": 10
})
# Register model
mlflow.sklearn.log_model(model, "churn-model")
# Later: Load registered model
model = mlflow.sklearn.load_model("models:/customer-churn/Production")
# Apply to production data
def predict_churn(df):
predictions = model.predict(df)
return spark.createDataFrame(predictions, "churn_probability DOUBLE")
# Apply to streaming data (real-time predictions!)
streaming_df = spark.readStream.table("customer_activity")
predictions = streaming_df.mapInPandas(predict_churn, "churn_probability DOUBLE")
predictions.writeStream.table("churn_predictions").start()
Cost Optimization in Databricks
1. Use Spot Instances
# Cluster configuration: Use spot instances to save 70%
cluster_config = {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"num_workers": 10,
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK", # Use spot, fallback to on-demand
"zone_id": "us-west-2a"
},
"auto_terminate_minutes": 30 # Stop if idle
}
# Cost:
# On-demand: $0.312/hour × 11 nodes = $3.432/hour
# Spot: $0.094/hour × 11 nodes = $1.034/hour (70% savings!)
2. Photon Acceleration
-- Photon speeds up SQL queries 2-5x (native execution engine)
SELECT
customer_segment,
COUNT(*) as num_orders,
SUM(amount) as total_revenue
FROM sales
WHERE order_date >= '2024-01-01'
GROUP BY customer_segment;
-- Photon automatically optimizes this query
-- No code changes needed!
3. Query Optimization
-- ❌ BAD: Full scan of massive table
SELECT COUNT(*) FROM events; -- Reads all 100TB
-- ✅ GOOD: Use Delta statistics (no scan needed!)
SELECT COUNT(*) FROM events; -- Uses metadata (instant!)
-- Delta Lake stores row counts, min/max values
-- ✅ GOOD: Partition pruning
SELECT * FROM events
WHERE date = '2024-01-15'; -- Only reads 1 day's partition
Databricks vs. Competitors
| Feature | Databricks | BigQuery | Snowflake |
|---|---|---|---|
| ML-native | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Data lake | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Query speed | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| ACID guarantees | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
| Ease of use | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Cost at scale | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Cost at small scale | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
Real Example: Complete Medallion Architecture in Databricks
# Databricks: Medallion architecture using Delta Lake
# BRONZE: Raw data (append-only)
spark.sql("""
CREATE TABLE IF NOT EXISTS bronze_customers USING DELTA
LOCATION 's3://bucket/bronze/customers'
AS SELECT
*,
current_timestamp() as _elt_timestamp
FROM source_data
""")
# SILVER: Validated data (SCD Type 2)
spark.sql("""
CREATE TABLE IF NOT EXISTS silver_customers USING DELTA
LOCATION 's3://bucket/silver/customers';
MERGE INTO silver_customers t
USING (
SELECT
customer_id, email, phone, updated_at
FROM bronze_customers
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND s.updated_at > t.updated_at THEN
UPDATE SET is_current = FALSE, end_date = current_date()
WHEN NOT MATCHED THEN
INSERT (customer_id, email, phone, effective_date, is_current)
VALUES (s.customer_id, s.email, s.phone, current_date(), TRUE)
""")
# GOLD: Aggregated data for BI
spark.sql("""
CREATE TABLE IF NOT EXISTS gold_customer_summary USING DELTA
LOCATION 's3://bucket/gold/customer_summary'
AS SELECT
c.customer_id,
c.email,
COUNT(DISTINCT o.order_id) as num_orders,
SUM(o.amount) as lifetime_value,
MAX(o.order_date) as last_order_date
FROM silver_customers c
LEFT JOIN silver_orders o ON c.customer_id = o.customer_id
WHERE c.is_current = TRUE
GROUP BY c.customer_id, c.email
""")
Key Takeaways
- Databricks is best for ML + large-scale ETL, not simple analytics
- Delta Lake provides ACID guarantees on data lakes (game-changer)
- Use spot instances to reduce costs by 70%
- MLflow tracks experiments and manages models end-to-end
- For small data (< 10TB), use BigQuery/Snowflake instead
- Combine with BI tools (Tableau, Power BI) for dashboards