8.05 · Deep Dive: Databricks for ML + Analytics

Level: Advanced Time to read: 18 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand Databricks architecture, Delta Lake, ML workflows, and cost optimization strategies.


Databricks: Unified Analytics Platform

Databricks = Apache Spark + Delta Lake + ML workflows + BI tools (all managed)

Architecture

Databricks Platform
┌─────────────────────────────────────┐
│  Workspace (Web UI + Notebooks)     │
├─────────────────────────────────────┤
│  Compute Cluster Management         │
│  (Auto-scaling, Spot instances)     │
├─────────────────────────────────────┤
│  Apache Spark Engine                │
│  (Distributed SQL, Python, Scala)   │
├─────────────────────────────────────┤
│  Delta Lake (ACID tables)           │
│  (Parquet + transaction log)        │
├─────────────────────────────────────┤
│  Cloud Storage (S3, ADLS, GCS)      │
└─────────────────────────────────────┘

Strengths

Unified platform - SQL, Python, Scala, R in one place ✅ Delta Lake - ACID guarantees on data lake (schema on read → schema enforced) ✅ Auto-scaling - Clusters scale up/down based on workload ✅ ML-native - MLflow for experiment tracking, model registry ✅ Cost-effective - Spot instances, auto-termination ✅ Data governance - Unity Catalog for lineage, access control ✅ Fast queries - Photon acceleration for vectorized execution


Weaknesses

Complex pricing model - DBUs (Databricks Units) can be expensive ❌ Cluster startup time - 2-5 minutes to provision cluster ❌ Learning curve - Spark concepts (RDDs, DAGs, partitions) are complex ❌ Overkill for small data - Better to use BigQuery/Snowflake for < 10TB ❌ Cluster management - Still need to manage clusters (unlike serverless BigQuery)


When to Use Databricks

Good Fit

  • ML pipelines - MLflow tracking, model registry, feature store
  • Large-scale ETL - Multi-cluster, terabyte-scale transformations
  • Real-time + batch - Structured Streaming + batch jobs
  • Data lake analytics - Query multiple data formats (Parquet, CSV, JSON)
  • Team collaboration - Shared notebooks, version control integration
  • Cost-sensitive at scale - Spot instances save 70% vs. on-demand

Poor Fit

  • Startup with < 1TB data - Overhead of platform
  • Simple BI dashboards - Use BigQuery/Snowflake
  • OLTP workloads - Use PostgreSQL/MySQL
  • Real-time dashboards - 5-minute latency from Spark jobs

Delta Lake: The Game-Changer

Problem: Data lakes (Parquet on S3) have no schema enforcement, no ACID guarantees

Traditional Data Lake (S3 + Parquet):
s3://my-bucket/
├── sales/
│   ├── 2024-01-01.parquet
│   ├── 2024-01-02.parquet
│   └── ... (100+ files)
Problem:
❌ No schema evolution (what if column added?)
❌ No ACID (partial writes fail, data corrupt)
❌ No rollback (deleted data is gone)
❌ No time travel (can't query historical state)

Solution: Delta Lake

Delta Lake (S3 + Parquet + Transaction Log):
s3://my-bucket/
├── sales/
│   ├── _delta_log/
│   │   ├── 00000000000000000000.json  (transaction log)
│   │   └── ... (version history)
│   ├── part-00000.parquet
│   └── part-00001.parquet
Benefits:
✅ Schema enforcement (DDL prevents invalid data)
✅ ACID transactions (all-or-nothing writes)
✅ Rollback (revert to previous version)
✅ Time travel (query data as of specific time)
✅ Data lineage (track changes)

Example: ACID Transactions

# Databricks: ACID UPDATE (impossible in regular data lakes!)
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "s3://bucket/sales")

# Update with ACID guarantees
deltaTable.update(
    condition="customer_id = 12345",
    set={"status": "'REFUNDED'", "amount": "amount * -1"}
)

# If update fails halfway, entire transaction rolls back
# Data lake remains consistent!

# Time travel: Query as of 1 hour ago
spark.sql("""
SELECT * FROM delta.`s3://bucket/sales`
VERSION AS OF 123
""")

# Merge (upsert): Insert or update based on condition
deltaTable.merge(
    source_df,
    "target.customer_id = source.customer_id"
).whenMatched().update(
    set={"email": "source.email"}
).whenNotMatched().insert(
    values={"*": "*"}
).execute()

Databricks for ML + Analytics

Architecture: Complete ML Workflow

1. Data Ingestion
   ↓ (PySpark, SQL)

2. Feature Engineering
   ↓ (Databricks Feature Store)

3. Model Training
   ↓ (MLflow tracking experiments)

4. Model Registry
   ↓ (Stage: Dev → Staging → Production)

5. Batch/Streaming Inference
   ↓ (Apply model to new data)

6. Monitoring
   ↓ (MLflow tracking metrics)

Example: End-to-End ML Pipeline

import mlflow
from sklearn.ensemble import RandomForestClassifier

# Enable MLflow tracking
mlflow.set_experiment("/Users/data-eng/customer-churn")

with mlflow.start_run(run_name="rf-v1"):
    # Load data
    train_df = spark.read.delta("dbfs:/data/customer_features_train")
    X_train = train_df.toPandas()
    y_train = X_train.pop("is_churned")

    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)

    # Log metrics
    accuracy = model.score(X_train, y_train)
    mlflow.log_metric("accuracy", accuracy)

    # Log hyperparameters
    mlflow.log_params({
        "n_estimators": 100,
        "max_depth": 10
    })

    # Register model
    mlflow.sklearn.log_model(model, "churn-model")

# Later: Load registered model
model = mlflow.sklearn.load_model("models:/customer-churn/Production")

# Apply to production data
def predict_churn(df):
    predictions = model.predict(df)
    return spark.createDataFrame(predictions, "churn_probability DOUBLE")

# Apply to streaming data (real-time predictions!)
streaming_df = spark.readStream.table("customer_activity")
predictions = streaming_df.mapInPandas(predict_churn, "churn_probability DOUBLE")
predictions.writeStream.table("churn_predictions").start()

Cost Optimization in Databricks

1. Use Spot Instances

# Cluster configuration: Use spot instances to save 70%
cluster_config = {
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "i3.xlarge",
    "num_workers": 10,
    "aws_attributes": {
        "availability": "SPOT_WITH_FALLBACK",  # Use spot, fallback to on-demand
        "zone_id": "us-west-2a"
    },
    "auto_terminate_minutes": 30  # Stop if idle
}

# Cost:
# On-demand: $0.312/hour × 11 nodes = $3.432/hour
# Spot: $0.094/hour × 11 nodes = $1.034/hour (70% savings!)

2. Photon Acceleration

-- Photon speeds up SQL queries 2-5x (native execution engine)
SELECT
  customer_segment,
  COUNT(*) as num_orders,
  SUM(amount) as total_revenue
FROM sales
WHERE order_date >= '2024-01-01'
GROUP BY customer_segment;

-- Photon automatically optimizes this query
-- No code changes needed!

3. Query Optimization

-- ❌ BAD: Full scan of massive table
SELECT COUNT(*) FROM events;  -- Reads all 100TB

-- ✅ GOOD: Use Delta statistics (no scan needed!)
SELECT COUNT(*) FROM events;  -- Uses metadata (instant!)
-- Delta Lake stores row counts, min/max values

-- ✅ GOOD: Partition pruning
SELECT * FROM events
WHERE date = '2024-01-15';  -- Only reads 1 day's partition

Databricks vs. Competitors

Feature Databricks BigQuery Snowflake
ML-native ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐
Data lake ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐
Query speed ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
ACID guarantees ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐⭐
Ease of use ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐
Cost at scale ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐
Cost at small scale ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐

Real Example: Complete Medallion Architecture in Databricks

# Databricks: Medallion architecture using Delta Lake

# BRONZE: Raw data (append-only)
spark.sql("""
CREATE TABLE IF NOT EXISTS bronze_customers USING DELTA
LOCATION 's3://bucket/bronze/customers'
AS SELECT 
  *,
  current_timestamp() as _elt_timestamp
FROM source_data
""")

# SILVER: Validated data (SCD Type 2)
spark.sql("""
CREATE TABLE IF NOT EXISTS silver_customers USING DELTA
LOCATION 's3://bucket/silver/customers';

MERGE INTO silver_customers t
USING (
  SELECT
    customer_id, email, phone, updated_at
  FROM bronze_customers
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET is_current = FALSE, end_date = current_date()
WHEN NOT MATCHED THEN
  INSERT (customer_id, email, phone, effective_date, is_current)
  VALUES (s.customer_id, s.email, s.phone, current_date(), TRUE)
""")

# GOLD: Aggregated data for BI
spark.sql("""
CREATE TABLE IF NOT EXISTS gold_customer_summary USING DELTA
LOCATION 's3://bucket/gold/customer_summary'
AS SELECT
  c.customer_id,
  c.email,
  COUNT(DISTINCT o.order_id) as num_orders,
  SUM(o.amount) as lifetime_value,
  MAX(o.order_date) as last_order_date
FROM silver_customers c
LEFT JOIN silver_orders o ON c.customer_id = o.customer_id
WHERE c.is_current = TRUE
GROUP BY c.customer_id, c.email
""")

Key Takeaways

  1. Databricks is best for ML + large-scale ETL, not simple analytics
  2. Delta Lake provides ACID guarantees on data lakes (game-changer)
  3. Use spot instances to reduce costs by 70%
  4. MLflow tracks experiments and manages models end-to-end
  5. For small data (< 10TB), use BigQuery/Snowflake instead
  6. Combine with BI tools (Tableau, Power BI) for dashboards