8.02 · Deep Dive: PostgreSQL for Data Warehousing

Level: Intermediate to Advanced Time to read: 15 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand PostgreSQL's strengths/weaknesses for analytics, extensions for DWH, and when to use it vs. cloud alternatives.


PostgreSQL: OLTP Database with Analytics Capabilities

Architecture Overview

PostgreSQL Architecture
┌─────────────────────────────────┐
│  Query Parser & Optimizer       │
│  (EXPLAIN ANALYZE)              │
├─────────────────────────────────┤
│  Execution Engine               │
│  (Sequential Scan, Index Scan)  │
├─────────────────────────────────┤
│  Storage Engine                 │
│  (B-tree indexes, TOAST)        │
├─────────────────────────────────┤
│  File System (Local Disk)       │
└─────────────────────────────────┘

Strengths

Open source (free to use, modify, deploy) ✅ Mature and stable (production-ready since 1996) ✅ Strong ACID guarantees (perfect for transactional + small-scale analytics) ✅ Rich SQL (CTEs, window functions, JSON, arrays) ✅ Extensions (TimescaleDB for time-series, Citus for distribution) ✅ Easy to operate (can run on $10/month VPS)

Weaknesses for Data Warehousing

Row-oriented storage (not columnar → slow analytical queries on large tables) ❌ No distributed query (single machine, scaling requires Citus extensions) ❌ Manual scaling (no auto-scaling, requires DBA intervention) ❌ Vertical scaling limit (max ~1TB RAM on single machine) ❌ Storage per query cost (must scan all rows, no partition pruning like cloud)


When to Use PostgreSQL for Analytics

Good Fit

Scenario Why
Startup with < 1TB data Cost-effective, easy to operate
Mixed OLTP + OLAP Single database for transactions + small analytics
Real-time analytics Good for sub-second query execution
Data lake + queries With extensions (Citus, Foreign Data Wrappers)
Learning/prototyping Free, easy setup, familiar SQL

Poor Fit

Scenario Why
> 10TB data warehouse Row-oriented = slow, expensive scaling
High concurrency Many concurrent analytical queries slow each other down
Cloud-first org BigQuery/Snowflake better integration
Heavy ML/analytics No built-in ML, analytics-optimized features

PostgreSQL for Analytics: Optimization Techniques

Technique 1: Columnar Compression with Citus

-- Install Citus extension (distributed PostgreSQL)
CREATE EXTENSION citus;

-- Create distributed table (partitioned across nodes)
SELECT create_distributed_table('fact_sales', 'customer_id');

-- Result: Fact table spread across multiple PostgreSQL nodes
-- Benefits:
-- ✅ Scales to 100TB+
-- ✅ Parallelizes queries across nodes
-- ❌ Requires infrastructure (Citus operator/managed service)

Technique 2: TimescaleDB for Time-Series

-- Install TimescaleDB extension
CREATE EXTENSION timescaledb;

-- Create hypertable (auto-partitioned by time)
SELECT create_hypertable('metrics', 'timestamp');

-- TimescaleDB automatically partitions by time
-- Result:
-- ✅ Fast queries on recent data (only scan relevant partition)
-- ✅ Compression for old data (95% smaller)
-- ✅ Time-series functions (time_bucket, gap_fill)

-- Example: Compress data > 30 days old
SELECT add_compression_policy('metrics', INTERVAL '30 days');

-- Query: Only reads last 7 days of data (fast!)
SELECT AVG(value) FROM metrics
WHERE timestamp >= now() - INTERVAL '7 days'
GROUP BY time_bucket('1 hour', timestamp);

Technique 3: Partitioning

-- Manual partitioning (without TimescaleDB)
CREATE TABLE fact_sales_2024 (
  sale_id INT,
  sale_date DATE,
  amount DECIMAL,
  customer_id INT
);

CREATE TABLE fact_sales_2025 (
  sale_id INT,
  sale_date DATE,
  amount DECIMAL,
  customer_id INT
);

-- Create parent table
CREATE TABLE fact_sales (LIKE fact_sales_2024 INCLUDING ALL);

-- Attach partitions
ALTER TABLE fact_sales ATTACH PARTITION fact_sales_2024
  FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

ALTER TABLE fact_sales ATTACH PARTITION fact_sales_2025
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

-- Query automatically uses partition pruning
SELECT SUM(amount)
FROM fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31';
-- ✅ Only scans fact_sales_2024

Technique 4: Materialized Views

-- Pre-compute aggregate (like a table snapshot)
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
  DATE(sale_date) as sale_date,
  product_category,
  SUM(amount) as revenue
FROM fact_sales
GROUP BY DATE(sale_date), product_category;

-- Create index on materialized view
CREATE INDEX idx_mv_daily_date ON mv_daily_revenue(sale_date);

-- Query MV instead of raw table (fast!)
SELECT * FROM mv_daily_revenue
WHERE sale_date >= '2024-01-01'
ORDER BY revenue DESC;

-- Refresh when source data changes
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_revenue;

PostgreSQL vs. Cloud Data Warehouses

Aspect PostgreSQL BigQuery Snowflake
Cost model Server cost Pay-per-query Pay-per-compute
Scaling Manual (Citus required) Auto Auto
Storage Local disk Google Cloud Storage S3/GCS
Query speed Seconds (row-oriented) Seconds (columnar) Seconds (columnar)
Setup time Hours (DBA setup) Minutes Minutes
For 100TB $$$$ server Competitive Competitive
For 100GB $10/month $$$$ (over-provisioned) $$$$ (over-provisioned)

PostgreSQL for Modern Data Stacks

Architecture: PostgreSQL + Fivetran + dbt

Operational PostgreSQL
Fivetran (change capture)
Cloud Data Warehouse (BigQuery/Snowflake)
dbt (transformation)
Analytics

Benefits:

  • ✅ Keep OLTP database (PostgreSQL) for transactional data
  • ✅ Copy to cloud DWH for analytics
  • ✅ Use dbt for transformation (SQL-native)
  • ✅ Scale analytics independently of OLTP

Real Example: PostgreSQL Warehouse Query

-- Medallion architecture in PostgreSQL

-- Bronze layer (raw copy from source)
CREATE TABLE bronze_orders (
  order_id INT,
  customer_id INT,
  amount DECIMAL,
  created_at TIMESTAMP,
  _elt_timestamp TIMESTAMP,
  _elt_source VARCHAR(50)
);

-- Silver layer (validated)
CREATE TABLE silver_orders AS
SELECT
  order_id,
  customer_id,
  amount,
  created_at,
  _elt_timestamp
FROM bronze_orders
WHERE amount > 0
  AND customer_id IS NOT NULL
  AND created_at < CURRENT_TIMESTAMP;

-- Create index for common queries
CREATE INDEX idx_silver_orders_customer_date
ON silver_orders(customer_id, created_at DESC);

-- Gold layer (aggregated for dashboards)
CREATE MATERIALIZED VIEW gold_daily_revenue AS
SELECT
  DATE(created_at) as sale_date,
  customer_id,
  COUNT(*) as num_orders,
  SUM(amount) as daily_revenue,
  AVG(amount) as avg_order_value
FROM silver_orders
GROUP BY DATE(created_at), customer_id;

CREATE INDEX idx_gold_date ON gold_daily_revenue(sale_date);

-- Query dashboard (uses MV, very fast)
SELECT *
FROM gold_daily_revenue
WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY daily_revenue DESC
LIMIT 10;

PostgreSQL Extensions for Data Engineering

Extension Purpose Use Case
Citus Distributed queries Scale to 100TB+
TimescaleDB Time-series compression Metrics, monitoring data
PostGIS Geospatial queries Location-based analytics
Foreign Data Wrapper Query external databases Join PostgreSQL + S3 + BigQuery
PL/Python Python functions in SQL Complex transformations
pg_partman Partition management Auto-partition tables

Key Takeaways

  1. PostgreSQL is OLTP-first (good for transactional, small analytics)
  2. For large data warehouses, use cloud solutions (BigQuery, Snowflake)
  3. Citus extension enables distributed PostgreSQL (alternative to cloud)
  4. TimescaleDB optimizes for time-series data
  5. Materialized views pre-compute aggregations
  6. Partitioning improves query performance on large tables