Skip to content

15 · Write-Ahead Log (WAL) — Durability via Sequential Log Writes

Storage Internals · Topic 15 of 16


What is a WAL?

The Write-Ahead Log is the fundamental durability mechanism in almost every database. Changes are written to the log first, before being applied to the actual data pages. If a crash occurs, the database can replay the log to recover.

"The log is the database." — Martin Kleppmann


How WAL Works

sequenceDiagram
    participant App
    participant DB
    participant WAL
    participant DataPage

    App->>DB: BEGIN; UPDATE accounts SET balance=900 WHERE id=1; COMMIT;
    DB->>WAL: Append log record (LSN, before-image, after-image)
    WAL->>Disk: fsync (flush to durable storage)
    DB->>App: COMMIT OK
    DB->>DataPage: Write dirty page (async, background)

The data page write can be deferred — the WAL ensures the change is durable.


WAL Log Record Contents

Field Description
LSN Log Sequence Number — monotonically increasing
Transaction ID Which transaction made the change
Before-image Old value (for rollback)
After-image New value (for redo)
Page ID Which data page was modified

Crash Recovery

On startup after a crash, the DB runs:

  1. Analysis phase: determine which transactions were in-flight
  2. Redo phase: replay committed changes not yet in data pages
  3. Undo phase: roll back uncommitted transactions

WAL Beyond Durability

Use Case How WAL Helps
Replication Replicas stream and replay the WAL
CDC (Change Data Capture) Tools like Debezium read the WAL to stream changes
Point-in-time recovery Replay WAL from a snapshot to any point
Logical replication Decode WAL into SQL-level changes

Cloud Implementations

  • WAL written to pg_wal/ directory
  • wal_level = logical enables CDC via logical replication slots
  • archive_command ships WAL to S3/GCS for PITR
  • Tools: Debezium, pglogical, wal2json
  • Write requests go through Paxos log (distributed WAL equivalent)
  • Each Paxos replica maintains the log; majority must acknowledge before commit
  • WAL internal to the service; not exposed externally
  • DynamoDB Streams is the CDC mechanism (not WAL access)
  • Commit Log is the Cassandra equivalent of WAL
  • Written before MemTable update; replayed on node restart
  • commitlog_sync: periodic (default) vs batch (fsync on every write)
  • WiredTiger uses a WAL (journal)
  • Oplog is a separate, capped collection used for replication (different from WAL)