Writing the Log
Writing the Log
The WAL file is an append-only sequence of log records. Each record describes a page write. A special COMMIT record marks the end of a transaction. The WAL class manages this file and integrates with the Pager so that every page modification is logged before it reaches the database file.
Log Record Format
Each log record has a fixed-size header followed by the page data:
┌────────────────────────────────────────────────────┐
│ Record Type (1B) │ Page Num (4B) │ Checksum (4B) │
├────────────────────────────────────────────────────┤
│ Page Data (4096 bytes) │
└────────────────────────────────────────────────────┘
| Field | Format | Size | Description |
|---|---|---|---|
| Record Type | B | 1 byte | 1 = PAGE_WRITE, 2 = COMMIT |
| Page Number | <I | 4 bytes | Which page this record applies to |
| Checksum | <I | 4 bytes | CRC32 of the page data |
| Page Data | raw bytes | 4096 bytes | Full page content (omitted for COMMIT records) |
The header is 9 bytes. A PAGE_WRITE record is 9 + 4096 = 4105 bytes. A COMMIT record is just the 9-byte header (with page_num = 0 and checksum = 0).
The checksum detects corruption in the WAL file itself. If the system crashes while writing a log record, the partially written record will have an incorrect checksum and will be discarded during recovery.
Constants and Record Types
# wal.py
import os
import struct
import zlib
from typing import Final
from constants import PAGE_SIZE
WAL_RECORD_HEADER_FORMAT: Final[str] = "<B I I" # type, page_num, checksum
WAL_RECORD_HEADER_SIZE: Final[int] = struct.calcsize(WAL_RECORD_HEADER_FORMAT) # 9
RECORD_PAGE_WRITE: Final[int] = 1
RECORD_COMMIT: Final[int] = 2
The WAL Class
class WAL:
"""Write-Ahead Log for crash-safe page writes."""
def __init__(self, wal_path: str) -> None:
self.wal_path = wal_path
self.fd = os.open(
wal_path,
os.O_RDWR | os.O_CREAT | os.O_APPEND,
0o644,
)
def log_page_write(self, page_num: int, page_data: bytes) -> None:
"""Append a PAGE_WRITE record to the WAL."""
assert len(page_data) == PAGE_SIZE
checksum = zlib.crc32(page_data) & 0xFFFFFFFF
header = struct.pack(
WAL_RECORD_HEADER_FORMAT,
RECORD_PAGE_WRITE,
page_num,
checksum,
)
os.write(self.fd, header + page_data)
def log_commit(self) -> None:
"""Append a COMMIT record and fsync."""
header = struct.pack(
WAL_RECORD_HEADER_FORMAT,
RECORD_COMMIT,
0, # page_num unused
0, # checksum unused
)
os.write(self.fd, header)
os.fsync(self.fd) # <-- THE critical fsync
def sync(self) -> None:
"""Force all WAL writes to stable storage."""
os.fsync(self.fd)
def close(self) -> None:
"""Close the WAL file descriptor."""
os.close(self.fd)
def truncate(self) -> None:
"""Clear the WAL after a successful checkpoint."""
os.ftruncate(self.fd, 0)
os.lseek(self.fd, 0, os.SEEK_SET)
os.fsync(self.fd)
Design notes:
os.O_APPENDensures everyos.writeatomically appends. Even if multiple threads write (which we don’t do), they won’t interleave.log_commitis the only place we fsync page write records. We batch-sync: all PAGE_WRITE records from a transaction are written first, then the COMMIT record + fsync flushes everything. This minimizes the number of fsync calls (which are expensive — typically 2–10ms each on SSD).truncateis called after checkpoint. It resets the WAL to zero length and fsyncs to ensure the truncation itself is durable.
Hooking into the Pager
The Pager must log every page write before flushing to the database file. We modify the Pager to accept a WAL instance:
class Pager:
"""Page cache and disk I/O manager, now with WAL support."""
def __init__(self, db_path: str) -> None:
self.db_path = db_path
self.fd = os.open(db_path, os.O_RDWR | os.O_CREAT, 0o644)
file_length = os.lseek(self.fd, 0, os.SEEK_END)
self.num_pages = file_length // PAGE_SIZE
self.pages: dict[int, memoryview] = {}
# Initialize WAL
self.wal = WAL(db_path + ".wal")
def get_page(self, page_num: int) -> memoryview:
"""Return the in-memory page, loading from disk if needed."""
if page_num not in self.pages:
buf = bytearray(PAGE_SIZE)
if page_num < self.num_pages:
os.lseek(self.fd, page_num * PAGE_SIZE, os.SEEK_SET)
data = os.read(self.fd, PAGE_SIZE)
buf[: len(data)] = data
self.pages[page_num] = memoryview(buf)
return self.pages[page_num]
def flush(self, page_num: int) -> None:
"""Write a page to the WAL, then to the database file."""
if page_num not in self.pages:
return
page_data = bytes(self.pages[page_num])
# Step 1: Log the page write
self.wal.log_page_write(page_num, page_data)
# Step 2: Write to the database file
os.lseek(self.fd, page_num * PAGE_SIZE, os.SEEK_SET)
os.write(self.fd, page_data)
# Update page count if needed
if page_num >= self.num_pages:
self.num_pages = page_num + 1
def commit(self) -> None:
"""Flush all dirty pages and write a COMMIT record."""
for page_num in list(self.pages.keys()):
self.flush(page_num)
self.wal.log_commit()
def checkpoint(self) -> None:
"""Fsync the database file and truncate the WAL."""
os.fsync(self.fd)
self.wal.truncate()
def get_unused_page_num(self) -> int:
"""Return the next available page number."""
page_num = self.num_pages
self.num_pages += 1
return page_num
def close(self) -> None:
"""Commit, checkpoint, and close all file descriptors."""
self.commit()
self.checkpoint()
self.wal.close()
os.close(self.fd)
The commit protocol has three phases:
flusheach dirty page: log to WAL, then write to database file.log_commit+fsync: the WAL now contains a complete committed transaction.checkpoint: fsync the database file, truncate the WAL.
If we crash during phase 1, the WAL has no COMMIT record — recovery discards the partial records. If we crash during phase 3, recovery replays the committed WAL records, bringing the database file up to date.
Correctness Argument
The WAL provides durability because:
- The WAL is fsynced before we report success. When
commit()returns, the COMMIT record is on stable storage. - The WAL is self-validating. Each record has a CRC32 checksum. Truncated or corrupted records are detectable.
- Recovery is idempotent. Replaying a WAL record that was already applied to the database file produces the same state (we write the full page, not a diff).
- The database file is never in an inconsistent state. If the WAL exists at startup, we replay it. If it doesn’t, the database is already consistent.
The next section implements the recovery procedure that runs when the database opens and finds an existing WAL file.