15 · Write-Ahead Log (WAL) — Durability via Sequential Log Writes
Storage Internals · Topic 15 of 16
What is a WAL?
The Write-Ahead Log is the fundamental durability mechanism in almost every database. Changes are written to the log first, before being applied to the actual data pages. If a crash occurs, the database can replay the log to recover.
"The log is the database." — Martin Kleppmann
How WAL Works
sequenceDiagram
participant App
participant DB
participant WAL
participant DataPage
App->>DB: BEGIN; UPDATE accounts SET balance=900 WHERE id=1; COMMIT;
DB->>WAL: Append log record (LSN, before-image, after-image)
WAL->>Disk: fsync (flush to durable storage)
DB->>App: COMMIT OK
DB->>DataPage: Write dirty page (async, background)
The data page write can be deferred — the WAL ensures the change is durable.
WAL Log Record Contents
| Field | Description |
|---|---|
| LSN | Log Sequence Number — monotonically increasing |
| Transaction ID | Which transaction made the change |
| Before-image | Old value (for rollback) |
| After-image | New value (for redo) |
| Page ID | Which data page was modified |
Crash Recovery
On startup after a crash, the DB runs:
- Analysis phase: determine which transactions were in-flight
- Redo phase: replay committed changes not yet in data pages
- Undo phase: roll back uncommitted transactions
WAL Beyond Durability
| Use Case | How WAL Helps |
|---|---|
| Replication | Replicas stream and replay the WAL |
| CDC (Change Data Capture) | Tools like Debezium read the WAL to stream changes |
| Point-in-time recovery | Replay WAL from a snapshot to any point |
| Logical replication | Decode WAL into SQL-level changes |
Cloud Implementations
- WAL written to
pg_wal/directory wal_level = logicalenables CDC via logical replication slotsarchive_commandships WAL to S3/GCS for PITR- Tools: Debezium, pglogical, wal2json
- Write requests go through Paxos log (distributed WAL equivalent)
- Each Paxos replica maintains the log; majority must acknowledge before commit
- WAL internal to the service; not exposed externally
- DynamoDB Streams is the CDC mechanism (not WAL access)
- Commit Log is the Cassandra equivalent of WAL
- Written before MemTable update; replayed on node restart
commitlog_sync:periodic(default) vsbatch(fsync on every write)
- WiredTiger uses a WAL (journal)
- Oplog is a separate, capped collection used for replication (different from WAL)