Skip to main content
python database internals building a persistent engine from scratch

Writing the Log

5 min read Chapter 17 of 21

Writing the Log

The WAL file is an append-only sequence of log records. Each record describes a page write. A special COMMIT record marks the end of a transaction. The WAL class manages this file and integrates with the Pager so that every page modification is logged before it reaches the database file.

Log Record Format

Each log record has a fixed-size header followed by the page data:

┌────────────────────────────────────────────────────┐
│ Record Type (1B) │ Page Num (4B) │ Checksum (4B)   │
├────────────────────────────────────────────────────┤
│ Page Data (4096 bytes)                             │
└────────────────────────────────────────────────────┘
FieldFormatSizeDescription
Record TypeB1 byte1 = PAGE_WRITE, 2 = COMMIT
Page Number<I4 bytesWhich page this record applies to
Checksum<I4 bytesCRC32 of the page data
Page Dataraw bytes4096 bytesFull page content (omitted for COMMIT records)

The header is 9 bytes. A PAGE_WRITE record is 9 + 4096 = 4105 bytes. A COMMIT record is just the 9-byte header (with page_num = 0 and checksum = 0).

The checksum detects corruption in the WAL file itself. If the system crashes while writing a log record, the partially written record will have an incorrect checksum and will be discarded during recovery.

Constants and Record Types

# wal.py
import os
import struct
import zlib
from typing import Final
from constants import PAGE_SIZE

WAL_RECORD_HEADER_FORMAT: Final[str] = "<B I I"  # type, page_num, checksum
WAL_RECORD_HEADER_SIZE: Final[int] = struct.calcsize(WAL_RECORD_HEADER_FORMAT)  # 9

RECORD_PAGE_WRITE: Final[int] = 1
RECORD_COMMIT: Final[int] = 2

The WAL Class

class WAL:
    """Write-Ahead Log for crash-safe page writes."""

    def __init__(self, wal_path: str) -> None:
        self.wal_path = wal_path
        self.fd = os.open(
            wal_path,
            os.O_RDWR | os.O_CREAT | os.O_APPEND,
            0o644,
        )

    def log_page_write(self, page_num: int, page_data: bytes) -> None:
        """Append a PAGE_WRITE record to the WAL."""
        assert len(page_data) == PAGE_SIZE
        checksum = zlib.crc32(page_data) & 0xFFFFFFFF
        header = struct.pack(
            WAL_RECORD_HEADER_FORMAT,
            RECORD_PAGE_WRITE,
            page_num,
            checksum,
        )
        os.write(self.fd, header + page_data)

    def log_commit(self) -> None:
        """Append a COMMIT record and fsync."""
        header = struct.pack(
            WAL_RECORD_HEADER_FORMAT,
            RECORD_COMMIT,
            0,  # page_num unused
            0,  # checksum unused
        )
        os.write(self.fd, header)
        os.fsync(self.fd)  # <-- THE critical fsync

    def sync(self) -> None:
        """Force all WAL writes to stable storage."""
        os.fsync(self.fd)

    def close(self) -> None:
        """Close the WAL file descriptor."""
        os.close(self.fd)

    def truncate(self) -> None:
        """Clear the WAL after a successful checkpoint."""
        os.ftruncate(self.fd, 0)
        os.lseek(self.fd, 0, os.SEEK_SET)
        os.fsync(self.fd)

Design notes:

  • os.O_APPEND ensures every os.write atomically appends. Even if multiple threads write (which we don’t do), they won’t interleave.
  • log_commit is the only place we fsync page write records. We batch-sync: all PAGE_WRITE records from a transaction are written first, then the COMMIT record + fsync flushes everything. This minimizes the number of fsync calls (which are expensive — typically 2–10ms each on SSD).
  • truncate is called after checkpoint. It resets the WAL to zero length and fsyncs to ensure the truncation itself is durable.

Hooking into the Pager

The Pager must log every page write before flushing to the database file. We modify the Pager to accept a WAL instance:

class Pager:
    """Page cache and disk I/O manager, now with WAL support."""

    def __init__(self, db_path: str) -> None:
        self.db_path = db_path
        self.fd = os.open(db_path, os.O_RDWR | os.O_CREAT, 0o644)
        file_length = os.lseek(self.fd, 0, os.SEEK_END)
        self.num_pages = file_length // PAGE_SIZE
        self.pages: dict[int, memoryview] = {}

        # Initialize WAL
        self.wal = WAL(db_path + ".wal")

    def get_page(self, page_num: int) -> memoryview:
        """Return the in-memory page, loading from disk if needed."""
        if page_num not in self.pages:
            buf = bytearray(PAGE_SIZE)
            if page_num < self.num_pages:
                os.lseek(self.fd, page_num * PAGE_SIZE, os.SEEK_SET)
                data = os.read(self.fd, PAGE_SIZE)
                buf[: len(data)] = data
            self.pages[page_num] = memoryview(buf)
        return self.pages[page_num]

    def flush(self, page_num: int) -> None:
        """Write a page to the WAL, then to the database file."""
        if page_num not in self.pages:
            return

        page_data = bytes(self.pages[page_num])

        # Step 1: Log the page write
        self.wal.log_page_write(page_num, page_data)

        # Step 2: Write to the database file
        os.lseek(self.fd, page_num * PAGE_SIZE, os.SEEK_SET)
        os.write(self.fd, page_data)

        # Update page count if needed
        if page_num >= self.num_pages:
            self.num_pages = page_num + 1

    def commit(self) -> None:
        """Flush all dirty pages and write a COMMIT record."""
        for page_num in list(self.pages.keys()):
            self.flush(page_num)
        self.wal.log_commit()

    def checkpoint(self) -> None:
        """Fsync the database file and truncate the WAL."""
        os.fsync(self.fd)
        self.wal.truncate()

    def get_unused_page_num(self) -> int:
        """Return the next available page number."""
        page_num = self.num_pages
        self.num_pages += 1
        return page_num

    def close(self) -> None:
        """Commit, checkpoint, and close all file descriptors."""
        self.commit()
        self.checkpoint()
        self.wal.close()
        os.close(self.fd)

The commit protocol has three phases:

  1. flush each dirty page: log to WAL, then write to database file.
  2. log_commit + fsync: the WAL now contains a complete committed transaction.
  3. checkpoint: fsync the database file, truncate the WAL.

If we crash during phase 1, the WAL has no COMMIT record — recovery discards the partial records. If we crash during phase 3, recovery replays the committed WAL records, bringing the database file up to date.

Correctness Argument

The WAL provides durability because:

  1. The WAL is fsynced before we report success. When commit() returns, the COMMIT record is on stable storage.
  2. The WAL is self-validating. Each record has a CRC32 checksum. Truncated or corrupted records are detectable.
  3. Recovery is idempotent. Replaying a WAL record that was already applied to the database file produces the same state (we write the full page, not a diff).
  4. The database file is never in an inconsistent state. If the WAL exists at startup, we replay it. If it doesn’t, the database is already consistent.

The next section implements the recovery procedure that runs when the database opens and finds an existing WAL file.