Skip to main content
python database internals building a persistent engine from scratch

Durability and Transactions

5 min read Chapter 16 of 21

Durability and the Write-Ahead Log

Our database engine can store millions of rows in a B-Tree, find any key in three page reads, and persist data to disk through the Pager. But it has a fatal flaw: it is not crash-safe.

Consider what happens during a leaf node split. The operation modifies three pages: the original leaf (truncated), the new sibling (created), and the parent internal node (updated with the promoted key). If the process crashes after writing two of the three pages, the tree is corrupted. Keys may be lost or duplicated, and the parent may point to an uninitialized page.

Even a single-page write is unsafe. os.write() transfers data to the kernel’s page cache, not directly to disk. The kernel may reorder writes, and a power failure can leave a page half-written — 4096 bytes on disk where the first 2048 are new data and the last 2048 are old data. The result is a torn page.

The ACID Properties

Database systems guarantee four properties, collectively known as ACID:

PropertyMeaningOur status
AtomicityAll operations in a transaction succeed or none doNot implemented
ConsistencyThe database transitions between valid states onlyEnforced by B-Tree invariants
IsolationConcurrent transactions don’t interfereNot applicable (single-writer)
DurabilityCommitted data survives crashesNot implemented

We focus on durability in this chapter. The mechanism that provides it is the Write-Ahead Log (WAL).

The WAL Principle

The WAL protocol is built on one rule:

Before modifying any page in the database file, write a log record describing the change to a separate log file, and ensure that log record is on stable storage.

The acronym says it all: write the log ahead of the data. If the system crashes:

  • If the log record was written, we can replay it to reconstruct the intended state.
  • If the log record was not written, the change was never applied, and the database file is in its previous consistent state.

Either way, the database is recoverable.

WAL Architecture

Our WAL is a separate file (e.g., mydb.wal) sitting alongside the database file (mydb.db). It stores a sequence of log records, each describing a page write:

Database file: mydb.db         WAL file: mydb.wal
┌─────────────────────┐       ┌────────────────────────────┐
│ Page 0 (4096 bytes) │       │ Record 0: [page 2, 4096B data] │
│ Page 1 (4096 bytes) │       │ Record 1: [page 0, 4096B data] │
│ Page 2 (4096 bytes) │       │ Record 2: [page 1, 4096B data] │
│ ...                 │       │ COMMIT marker                  │
└─────────────────────┘       └────────────────────────────┘

The lifecycle of a write under WAL:

  1. Modify page in memory (the Pager’s in-memory cache).
  2. Append a log record to the WAL file containing the page number and entire page content.
  3. fsync the WAL file — the log record is now durable.
  4. Repeat steps 1–3 for all pages in the transaction.
  5. Write a COMMIT marker to the WAL and fsync again.
  6. Checkpoint: copy the logged pages to the database file and truncate the WAL.

Steps 1–5 happen during the transaction. Step 6 can happen later, when convenient. If we crash between steps 3 and 5, the log contains uncommitted records that are discarded during recovery. If we crash after step 5 but before step 6, recovery replays the committed records.

Why Full-Page Logging?

Our log records store the entire page (4096 bytes), not just the changed bytes. This is called full-page logging or physiological logging. It is wasteful in terms of log space — changing one byte in a page still logs 4096 bytes — but it has two advantages:

  1. Torn page immunity. If the database file has a torn page (partially written), the WAL contains the complete correct page. We overwrite the torn page with the logged copy.

  2. Simplicity. We don’t need to track which bytes within a page changed. The log record is self-contained: “page N should look like this.”

Production databases like PostgreSQL use a hybrid approach: full-page images for the first write after a checkpoint, and byte-level diffs thereafter. For our engine, full-page logging is sufficient and correct.

The fsync Boundary

os.fsync(fd) asks the operating system to flush all buffered writes for file descriptor fd to stable storage. Until fsync returns, data written via os.write may reside only in the kernel’s page cache — lost on power failure.

Our WAL inserts fsync at two critical points:

  1. After writing log records — ensures the log is durable before we return to the caller.
  2. After writing the COMMIT marker — ensures the commit is durable.

We do not fsync the database file after every write. That would negate the performance benefit of the WAL. Instead, we fsync the database file only during checkpoint, after all logged pages have been copied.

What This Chapter Covers

SectionTopic
CH6-S1The log record format, WAL class, hooking into the Pager
CH6-S2Crash recovery: detecting dirty shutdowns, replaying the WAL

The next section defines the log record format and integrates WAL logging into the Pager’s write path.