8.02 · Deep Dive: PostgreSQL for Data Warehousing
Level: Intermediate to Advanced Time to read: 15 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand PostgreSQL's strengths/weaknesses for analytics, extensions for DWH, and when to use it vs. cloud alternatives.
PostgreSQL: OLTP Database with Analytics Capabilities
Architecture Overview
PostgreSQL Architecture
┌─────────────────────────────────┐
│ Query Parser & Optimizer │
│ (EXPLAIN ANALYZE) │
├─────────────────────────────────┤
│ Execution Engine │
│ (Sequential Scan, Index Scan) │
├─────────────────────────────────┤
│ Storage Engine │
│ (B-tree indexes, TOAST) │
├─────────────────────────────────┤
│ File System (Local Disk) │
└─────────────────────────────────┘
Strengths
✅ Open source (free to use, modify, deploy) ✅ Mature and stable (production-ready since 1996) ✅ Strong ACID guarantees (perfect for transactional + small-scale analytics) ✅ Rich SQL (CTEs, window functions, JSON, arrays) ✅ Extensions (TimescaleDB for time-series, Citus for distribution) ✅ Easy to operate (can run on $10/month VPS)
Weaknesses for Data Warehousing
❌ Row-oriented storage (not columnar → slow analytical queries on large tables) ❌ No distributed query (single machine, scaling requires Citus extensions) ❌ Manual scaling (no auto-scaling, requires DBA intervention) ❌ Vertical scaling limit (max ~1TB RAM on single machine) ❌ Storage per query cost (must scan all rows, no partition pruning like cloud)
When to Use PostgreSQL for Analytics
Good Fit
| Scenario | Why |
|---|---|
| Startup with < 1TB data | Cost-effective, easy to operate |
| Mixed OLTP + OLAP | Single database for transactions + small analytics |
| Real-time analytics | Good for sub-second query execution |
| Data lake + queries | With extensions (Citus, Foreign Data Wrappers) |
| Learning/prototyping | Free, easy setup, familiar SQL |
Poor Fit
| Scenario | Why |
|---|---|
| > 10TB data warehouse | Row-oriented = slow, expensive scaling |
| High concurrency | Many concurrent analytical queries slow each other down |
| Cloud-first org | BigQuery/Snowflake better integration |
| Heavy ML/analytics | No built-in ML, analytics-optimized features |
PostgreSQL for Analytics: Optimization Techniques
Technique 1: Columnar Compression with Citus
-- Install Citus extension (distributed PostgreSQL)
CREATE EXTENSION citus;
-- Create distributed table (partitioned across nodes)
SELECT create_distributed_table('fact_sales', 'customer_id');
-- Result: Fact table spread across multiple PostgreSQL nodes
-- Benefits:
-- ✅ Scales to 100TB+
-- ✅ Parallelizes queries across nodes
-- ❌ Requires infrastructure (Citus operator/managed service)
Technique 2: TimescaleDB for Time-Series
-- Install TimescaleDB extension
CREATE EXTENSION timescaledb;
-- Create hypertable (auto-partitioned by time)
SELECT create_hypertable('metrics', 'timestamp');
-- TimescaleDB automatically partitions by time
-- Result:
-- ✅ Fast queries on recent data (only scan relevant partition)
-- ✅ Compression for old data (95% smaller)
-- ✅ Time-series functions (time_bucket, gap_fill)
-- Example: Compress data > 30 days old
SELECT add_compression_policy('metrics', INTERVAL '30 days');
-- Query: Only reads last 7 days of data (fast!)
SELECT AVG(value) FROM metrics
WHERE timestamp >= now() - INTERVAL '7 days'
GROUP BY time_bucket('1 hour', timestamp);
Technique 3: Partitioning
-- Manual partitioning (without TimescaleDB)
CREATE TABLE fact_sales_2024 (
sale_id INT,
sale_date DATE,
amount DECIMAL,
customer_id INT
);
CREATE TABLE fact_sales_2025 (
sale_id INT,
sale_date DATE,
amount DECIMAL,
customer_id INT
);
-- Create parent table
CREATE TABLE fact_sales (LIKE fact_sales_2024 INCLUDING ALL);
-- Attach partitions
ALTER TABLE fact_sales ATTACH PARTITION fact_sales_2024
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
ALTER TABLE fact_sales ATTACH PARTITION fact_sales_2025
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Query automatically uses partition pruning
SELECT SUM(amount)
FROM fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31';
-- ✅ Only scans fact_sales_2024
Technique 4: Materialized Views
-- Pre-compute aggregate (like a table snapshot)
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
DATE(sale_date) as sale_date,
product_category,
SUM(amount) as revenue
FROM fact_sales
GROUP BY DATE(sale_date), product_category;
-- Create index on materialized view
CREATE INDEX idx_mv_daily_date ON mv_daily_revenue(sale_date);
-- Query MV instead of raw table (fast!)
SELECT * FROM mv_daily_revenue
WHERE sale_date >= '2024-01-01'
ORDER BY revenue DESC;
-- Refresh when source data changes
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_revenue;
PostgreSQL vs. Cloud Data Warehouses
| Aspect | PostgreSQL | BigQuery | Snowflake |
|---|---|---|---|
| Cost model | Server cost | Pay-per-query | Pay-per-compute |
| Scaling | Manual (Citus required) | Auto | Auto |
| Storage | Local disk | Google Cloud Storage | S3/GCS |
| Query speed | Seconds (row-oriented) | Seconds (columnar) | Seconds (columnar) |
| Setup time | Hours (DBA setup) | Minutes | Minutes |
| For 100TB | $$$$ server | Competitive | Competitive |
| For 100GB | $10/month | $$$$ (over-provisioned) | $$$$ (over-provisioned) |
PostgreSQL for Modern Data Stacks
Architecture: PostgreSQL + Fivetran + dbt
Operational PostgreSQL
↓
Fivetran (change capture)
↓
Cloud Data Warehouse (BigQuery/Snowflake)
↓
dbt (transformation)
↓
Analytics
Benefits:
- ✅ Keep OLTP database (PostgreSQL) for transactional data
- ✅ Copy to cloud DWH for analytics
- ✅ Use dbt for transformation (SQL-native)
- ✅ Scale analytics independently of OLTP
Real Example: PostgreSQL Warehouse Query
-- Medallion architecture in PostgreSQL
-- Bronze layer (raw copy from source)
CREATE TABLE bronze_orders (
order_id INT,
customer_id INT,
amount DECIMAL,
created_at TIMESTAMP,
_elt_timestamp TIMESTAMP,
_elt_source VARCHAR(50)
);
-- Silver layer (validated)
CREATE TABLE silver_orders AS
SELECT
order_id,
customer_id,
amount,
created_at,
_elt_timestamp
FROM bronze_orders
WHERE amount > 0
AND customer_id IS NOT NULL
AND created_at < CURRENT_TIMESTAMP;
-- Create index for common queries
CREATE INDEX idx_silver_orders_customer_date
ON silver_orders(customer_id, created_at DESC);
-- Gold layer (aggregated for dashboards)
CREATE MATERIALIZED VIEW gold_daily_revenue AS
SELECT
DATE(created_at) as sale_date,
customer_id,
COUNT(*) as num_orders,
SUM(amount) as daily_revenue,
AVG(amount) as avg_order_value
FROM silver_orders
GROUP BY DATE(created_at), customer_id;
CREATE INDEX idx_gold_date ON gold_daily_revenue(sale_date);
-- Query dashboard (uses MV, very fast)
SELECT *
FROM gold_daily_revenue
WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY daily_revenue DESC
LIMIT 10;
PostgreSQL Extensions for Data Engineering
| Extension | Purpose | Use Case |
|---|---|---|
| Citus | Distributed queries | Scale to 100TB+ |
| TimescaleDB | Time-series compression | Metrics, monitoring data |
| PostGIS | Geospatial queries | Location-based analytics |
| Foreign Data Wrapper | Query external databases | Join PostgreSQL + S3 + BigQuery |
| PL/Python | Python functions in SQL | Complex transformations |
| pg_partman | Partition management | Auto-partition tables |
Key Takeaways
- PostgreSQL is OLTP-first (good for transactional, small analytics)
- For large data warehouses, use cloud solutions (BigQuery, Snowflake)
- Citus extension enables distributed PostgreSQL (alternative to cloud)
- TimescaleDB optimizes for time-series data
- Materialized views pre-compute aggregations
- Partitioning improves query performance on large tables