Write-Ahead Logging: What System Design Interviews Actually Test

Your database is midway through writing a payment record. Power cut. Server reboots. What happens to that write?

If you designed your storage layer correctly, nothing bad. The transaction either committed fully or it didn't. Write-ahead logging (WAL) is why that guarantee is achievable, and understanding it will sharpen your answers on every system design question that touches durability, replication, or crash recovery.

The Problem: Durability Without Flushing Everything

A naive approach to durability would flush every modified data page to disk synchronously before acknowledging a write. That works, but your throughput is going to look like a DMV queue on a Monday morning.

Data pages live at random locations on disk. Random writes on a spinning disk cost 10 to 100 times more than sequential writes. On an SSD you avoid seek latency, but write amplification still hurts.

WAL flips the ordering. Instead of flushing the data page immediately, you write a small log record to a sequential append-only log file first. Once that log entry hits durable storage, you acknowledge the write to the client. The actual data page gets updated in the background. If a crash happens before the data page is updated, recovery replays the log to bring the page current.

A log record is typically tens to hundreds of bytes. Appending to the tail of a log file is one fast sequential write. Flushing the full dirty data page is a random write that might touch a 4 KB or 8 KB block somewhere else on disk. WAL trades the expensive random write for a cheap sequential write, then defers the random write to whenever it's convenient. This is the kind of beautiful laziness that actually scales.

How WAL Works, Step by Step

Every log record gets a Log Sequence Number (LSN): a 64-bit monotonically increasing value representing the byte position in the WAL stream. Each data page stores the LSN of the last log record that modified it.

A write follows this sequence:

The transaction generates a log record (what page, what bytes changed, old value, new value).
That record goes into the WAL buffer in memory.
Before the transaction commits, the WAL buffer is flushed to disk with fsync().
The commit is acknowledged to the client.
The data page is written to disk later, lazily.

Step 3 is the guarantee. Once the log entry survives to durable storage, the change is durable. You can always reconstruct the data page from the log. fsync() is the unglamorous hero nobody names their startup after, but it's what separates "probably fine" from "actually fine."

The database keeps a Dirty Page Table tracking which pages have been modified but not yet flushed. Periodically, a checkpoint process writes all dirty pages to disk and records a checkpoint LSN in the log. Recovery only replays log records from the last checkpoint forward, which is why checkpoint frequency matters for recovery time.

After a Crash: ARIES in Three Phases

The crash recovery algorithm used by most production databases is ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM in 1992. Three phases. IBM gave it an acronym that sounds like a zodiac sign, which is either deeply on-brand for enterprise software or completely accidental.

Analysis. Read the log from the last checkpoint to the end, reconstructing which transactions were in flight and which pages were dirty.

Redo. Replay all log records from the earliest dirty page forward, regardless of whether those transactions committed. This brings the database to the exact physical state it was in just before the crash, including any partially-committed work.

Undo. For every transaction that was active at crash time but never committed, reverse its changes using the before-images in the log. Partial transactions disappear.

The key behind ARIES is "no-force, steal." No-force means you don't have to flush dirty data pages before commit. Steal means you can evict dirty pages to disk before the transaction commits. Together these give maximum flexibility to the buffer pool manager without sacrificing correctness. Most of the complexity you encounter in real database source code traces back to making these two properties hold under every failure mode you can imagine.

WAL Looks Different Depending on the System

PostgreSQL

PostgreSQL stores WAL in pg_wal/ as 16 MB segment files. It also uses WAL for streaming replication: replicas connect to the primary and consume the WAL stream in real time. Replication slots track how far each replica has consumed, preventing the primary from deleting segments the replica still needs.

One important knob is synchronous_commit. Set to on (the default), the primary waits for its WAL to fsync before acknowledging the commit. Set to off, the write returns faster but you accept a window of up to 3 × wal_writer_delay where a crash could lose committed transactions. This is not a consistency risk. It's a durability tradeoff. Your users' payment confirmations evaporating on restart is a different kind of problem than inconsistency, and worth knowing for interviews.

SQLite WAL Mode

SQLite's default mode uses a rollback journal. WAL mode, enabled with PRAGMA journal_mode = WAL, creates a separate -wal file where modified pages are appended rather than written in place. Readers continue accessing the original database file while writers append to the WAL file. Reads and writes do not block each other.

A checkpoint periodically copies WAL frames back into the main database file. By default this triggers at 1,000 pages. SQLite WAL mode does not work over network filesystems because it relies on shared memory. This is the kind of thing you discover at 2am in production, not during testing.

RocksDB and LSM Trees

RocksDB combines WAL with a log-structured merge-tree (LSM). Writes go to WAL first (durability), then into an in-memory MemTable (typically a skip-list). When the MemTable reaches about 64 MB, it flushes to disk as an immutable SST file. Background compaction merges SST files across levels.

If the process crashes before the MemTable flushes, the WAL reconstructs it on restart. Without WAL, everything in memory since the last flush would be gone. Everything. The MemTable is the part of your data that exists in a state of uncomfortable temporary existence.

Kafka's storage model follows similar intuitions. Each partition is an append-only log divided into segments. New messages append to the active segment. It's not "WAL" in the database sense, but the design principle is identical: sequential appends are fast, and the log is the source of truth. For more on how message queues are designed, see the distributed message queue system design guide.

Replication Comes Almost for Free

The log already contains a complete description of every change in the right order. A replica only needs to consume and replay it. WAL enables replication without a separate synchronization mechanism. The engineers who figured this out saved everyone who came after them from building a separate change-tracking system and then somehow keeping it in sync with the actual data.

This is why PostgreSQL streaming replication is reliable. The replica receives log records from the primary in real time, applies them, and stays synchronized. Replication slots ensure the primary retains WAL segments until the replica catches up.

It also enables point-in-time recovery (PITR). Archive your WAL segments to object storage. Restore a base backup, then replay WAL segments up to any target timestamp. This is standard practice for production databases handling financial data. For a concrete example of how this matters in practice, see the payment system design walkthrough.

The Tradeoffs (This Is What the Interview Is Testing)

Write amplification. Every change is written twice: once to the log, once to the data page. On a single disk, sequential log writes and random page writes compete for the same I/O bandwidth. The standard mitigation is a dedicated disk for WAL, separating the two I/O patterns. Yes, this means the correct answer to "how do you make your database faster" sometimes literally is "buy another disk."

Recovery time scales with log size. The system replays everything from the last checkpoint. A long checkpoint interval means slower recovery. Frequent checkpoints keep recovery fast but add write overhead because you are flushing dirty pages more often. The interval is a tuning parameter, not a fixed property.

WAL bloat. If archiving or replication falls behind, WAL segments accumulate. PostgreSQL will PANIC (that is the actual log message, in all caps, because it warrants it) if pg_wal/ fills the filesystem. Replication slots that go unconsumed silently prevent old segments from being cleaned up. The fix is max_slot_wal_keep_size, but you need to know the problem exists before you can configure the fix.

The deal WAL makes: pay a sequential write on every transaction, gain crash safety and replication capability. For most systems handling any important state, that's obvious. For write-heavy systems where durability requirements are weaker (a cache layer, a leaderboard), you might skip WAL entirely and accept that a crash loses recent writes. Compare with distributed cache design where in-memory stores like Redis offer optional persistence precisely because not all caches need full WAL-backed durability.

Write-Ahead Logging in a System Design Interview: How to Frame It

Interviewers awarding strong scores are not looking for a recitation of ARIES phases. They want to see you reason about the durability boundary in your design. There's a big difference between "I would use PostgreSQL" and "I would use PostgreSQL because WAL-backed crash recovery gives me atomic transactions even if the instance dies mid-write."

When you are designing a payment system: "For the ledger writes, we need a WAL-backed database. A crash mid-transaction should never result in partial state. The WAL guarantees atomicity on recovery."

When you are discussing replication: "Streaming replication works by shipping WAL records to replicas. The primary does not need to do anything extra. Replicas replay the log."

When you are asked about recovery time: "Recovery time depends on how much WAL needs to be replayed from the last checkpoint. Frequent checkpoints reduce recovery time at the cost of more write I/O during normal operation."

The signal you are sending: you understand why the system behaves the way it does under failure, not just what the happy path looks like. Interviewers have heard "I would use Postgres" approximately ten thousand times. They have heard "the WAL gives us free replication because replicas just consume and replay the log" maybe twice.

Practice making these arguments out loud. The gap between knowing WAL and explaining it clearly under interview conditions is wider than it looks. SpaceComplexity simulates these system design rounds with voice-based interviews and rubric-backed feedback so you can build that fluency before the real thing.

What You Should Be Able to Say Cold

WAL writes a log record to a sequential file before updating the data page. Durability comes from flushing the log, not the page.
LSN is a monotonically increasing byte offset in the WAL stream. Pages store the LSN of their last modification. Recovery uses this to determine what needs replaying.
Checkpoints flush dirty pages and record a safe replay point. Recovery only replays from the last checkpoint forward.
ARIES: Analysis (reconstruct state), Redo (replay forward), Undo (reverse uncommitted transactions).
WAL enables streaming replication because replicas just consume and replay the log.
Tradeoffs: write amplification, recovery time proportional to unreplayed log length, WAL bloat when archiving or replication falls behind.