Durability and Transactions
Durability and the Write-Ahead Log
Our database engine can store millions of rows in a B-Tree, find any key in three page reads, and persist data to disk through the Pager. But it has a fatal flaw: it is not crash-safe.
Consider what happens during a leaf node split. The operation modifies three pages: the original leaf (truncated), the new sibling (created), and the parent internal node (updated with the promoted key). If the process crashes after writing two of the three pages, the tree is corrupted. Keys may be lost or duplicated, and the parent may point to an uninitialized page.
Even a single-page write is unsafe. os.write() transfers data to the kernel’s page cache, not directly to disk. The kernel may reorder writes, and a power failure can leave a page half-written — 4096 bytes on disk where the first 2048 are new data and the last 2048 are old data. The result is a torn page.
The ACID Properties
Database systems guarantee four properties, collectively known as ACID:
| Property | Meaning | Our status |
|---|---|---|
| Atomicity | All operations in a transaction succeed or none do | Not implemented |
| Consistency | The database transitions between valid states only | Enforced by B-Tree invariants |
| Isolation | Concurrent transactions don’t interfere | Not applicable (single-writer) |
| Durability | Committed data survives crashes | Not implemented |
We focus on durability in this chapter. The mechanism that provides it is the Write-Ahead Log (WAL).
The WAL Principle
The WAL protocol is built on one rule:
Before modifying any page in the database file, write a log record describing the change to a separate log file, and ensure that log record is on stable storage.
The acronym says it all: write the log ahead of the data. If the system crashes:
- If the log record was written, we can replay it to reconstruct the intended state.
- If the log record was not written, the change was never applied, and the database file is in its previous consistent state.
Either way, the database is recoverable.
WAL Architecture
Our WAL is a separate file (e.g., mydb.wal) sitting alongside the database file (mydb.db). It stores a sequence of log records, each describing a page write:
Database file: mydb.db WAL file: mydb.wal
┌─────────────────────┐ ┌────────────────────────────┐
│ Page 0 (4096 bytes) │ │ Record 0: [page 2, 4096B data] │
│ Page 1 (4096 bytes) │ │ Record 1: [page 0, 4096B data] │
│ Page 2 (4096 bytes) │ │ Record 2: [page 1, 4096B data] │
│ ... │ │ COMMIT marker │
└─────────────────────┘ └────────────────────────────┘
The lifecycle of a write under WAL:
- Modify page in memory (the Pager’s in-memory cache).
- Append a log record to the WAL file containing the page number and entire page content.
fsyncthe WAL file — the log record is now durable.- Repeat steps 1–3 for all pages in the transaction.
- Write a COMMIT marker to the WAL and
fsyncagain. - Checkpoint: copy the logged pages to the database file and truncate the WAL.
Steps 1–5 happen during the transaction. Step 6 can happen later, when convenient. If we crash between steps 3 and 5, the log contains uncommitted records that are discarded during recovery. If we crash after step 5 but before step 6, recovery replays the committed records.
Why Full-Page Logging?
Our log records store the entire page (4096 bytes), not just the changed bytes. This is called full-page logging or physiological logging. It is wasteful in terms of log space — changing one byte in a page still logs 4096 bytes — but it has two advantages:
-
Torn page immunity. If the database file has a torn page (partially written), the WAL contains the complete correct page. We overwrite the torn page with the logged copy.
-
Simplicity. We don’t need to track which bytes within a page changed. The log record is self-contained: “page N should look like this.”
Production databases like PostgreSQL use a hybrid approach: full-page images for the first write after a checkpoint, and byte-level diffs thereafter. For our engine, full-page logging is sufficient and correct.
The fsync Boundary
os.fsync(fd) asks the operating system to flush all buffered writes for file descriptor fd to stable storage. Until fsync returns, data written via os.write may reside only in the kernel’s page cache — lost on power failure.
Our WAL inserts fsync at two critical points:
- After writing log records — ensures the log is durable before we return to the caller.
- After writing the COMMIT marker — ensures the commit is durable.
We do not fsync the database file after every write. That would negate the performance benefit of the WAL. Instead, we fsync the database file only during checkpoint, after all logged pages have been copied.
What This Chapter Covers
| Section | Topic |
|---|---|
| CH6-S1 | The log record format, WAL class, hooking into the Pager |
| CH6-S2 | Crash recovery: detecting dirty shutdowns, replaying the WAL |
The next section defines the log record format and integrates WAL logging into the Pager’s write path.