Skip to main content
python database internals building a persistent engine from scratch

The Storage Engine: Pages and the Pager

5 min read Chapter 4 of 21
Summary

The Pager module manages fixed-size pages for disk...

The Pager module manages fixed-size pages for disk storage, optimizing I/O operations. Each page is 4KB (4096 bytes), with a header using PAGE_HEADER_FORMAT ('<I I') for metadata like page number and data length, and a data section for actual content. The Pager class in Python 3.12+ implements file handling with attributes for filename and a page cache. Methods include open_file (handling file creation), close_file, read_page (returning a memoryview), write_page (ensuring exact PAGE_SIZE), and get_page_count. Trade-offs in page size are analyzed: 2KB pages reduce memory usage but increase overhead; 4KB balances performance; 8KB improves throughput but has higher memory footprint. The design emphasizes modularity, using struct and memoryview for low-level binary data manipulation, ensuring efficient disk I/O and data consistency.

The Storage Engine: Pages and the Pager

We have a row format (36 bytes, fixed-width) and a page size (4096 bytes). Now we need the component that sits between the B-Tree and the operating system: the Pager. Its contract is straightforward — given a page number, return that page’s data in memory. Given dirty page data, write it back to disk. Everything else in the database treats the Pager as an infinite array of 4 KB slots.

This chapter builds the Pager class from the ground up: opening files, reading pages, writing pages, and managing the page count. We deliberately leave out caching and eviction for now — the goal is a correct, minimal implementation that we can layer optimizations onto later.

What the Pager Must Do

The Pager owns the database file. It is the only module that calls open(), read(), write(), and seek(). Every other component — Table, Cursor, B-Tree, VM — interacts with data exclusively through the Pager.

Its responsibilities:

  1. Open the database file in binary read-write mode, creating it if it does not exist.
  2. Track the page count based on the file’s byte length.
  3. Read a page from disk given its page number, returning a memoryview.
  4. Write a page back to disk, ensuring exactly PAGE_SIZE bytes are flushed.
  5. Close the file cleanly on shutdown.

The Pager does not decide what goes inside a page. In Chapters 1–3, pages hold sequential rows. Starting in Chapter 4, they hold B-Tree nodes. The Pager does not care — it moves opaque 4 KB blocks.

The Pager Class

# pager.py
import os
import struct
from typing import Optional
from constants import PAGE_SIZE, MAX_PAGES

class Pager:
    """Manages fixed-size page I/O against a single database file."""

    def __init__(self, filename: str) -> None:
        self.filename: str = filename
        self.file_descriptor: Optional[int] = None
        self.file_length: int = 0
        self.num_pages: int = 0
        self.pages: list[Optional[bytearray]] = [None] * MAX_PAGES

    def open(self) -> None:
        """Open the database file in binary read/write mode.
        Creates the file if it does not exist.
        """
        flags = os.O_RDWR | os.O_CREAT
        self.file_descriptor = os.open(self.filename, flags, 0o644)
        self.file_length = os.lseek(self.file_descriptor, 0, os.SEEK_END)

        if self.file_length % PAGE_SIZE != 0:
            raise IOError(
                f"Database file is not a whole number of pages. "
                f"File length: {self.file_length}, page size: {PAGE_SIZE}"
            )
        self.num_pages = self.file_length // PAGE_SIZE

    def get_page(self, page_num: int) -> memoryview:
        """Return a memoryview of the requested page.
        Reads from disk on first access; subsequent calls return
        the cached in-memory copy.
        """
        if page_num >= MAX_PAGES:
            raise ValueError(
                f"Page number {page_num} exceeds maximum {MAX_PAGES}"
            )

        if self.pages[page_num] is None:
            # Allocate a fresh page buffer
            page = bytearray(PAGE_SIZE)

            # How many pages currently exist on disk?
            if page_num < self.num_pages:
                os.lseek(self.file_descriptor, page_num * PAGE_SIZE, os.SEEK_SET)
                bytes_read = os.read(self.file_descriptor, PAGE_SIZE)
                page[: len(bytes_read)] = bytes_read

            self.pages[page_num] = page

            # If we are creating a page beyond the current end, track it
            if page_num >= self.num_pages:
                self.num_pages = page_num + 1

        return memoryview(self.pages[page_num])

    def flush(self, page_num: int) -> None:
        """Write a single page to disk at the correct offset."""
        if self.pages[page_num] is None:
            raise ValueError(f"Tried to flush null page {page_num}")

        offset = page_num * PAGE_SIZE
        os.lseek(self.file_descriptor, offset, os.SEEK_SET)

        written = os.write(self.file_descriptor, self.pages[page_num])
        if written != PAGE_SIZE:
            raise IOError(
                f"Partial write on page {page_num}: "
                f"{written}/{PAGE_SIZE} bytes"
            )

    def close(self) -> None:
        """Flush all cached pages and close the file."""
        for i in range(self.num_pages):
            if self.pages[i] is not None:
                self.flush(i)
                self.pages[i] = None

        if self.file_descriptor is not None:
            os.close(self.file_descriptor)
            self.file_descriptor = None

Several things to note:

  • We use os.open / os.read / os.write instead of Python’s open(). The low-level POSIX calls give us direct control over file descriptors, flags, and seek positions. No internal Python buffering sits between us and the kernel.

  • get_page allocates lazily. The pages list starts as [None] * MAX_PAGES. We only allocate a 4 KB bytearray when someone actually asks for that page. This keeps memory usage proportional to the working set, not the maximum file size.

  • flush writes exactly PAGE_SIZE bytes. If the OS returns a short write (fewer bytes than requested), we raise immediately. A partial page on disk is corrupt data — there is no graceful recovery, so we fail loudly.

  • close flushes everything. On clean shutdown, every cached page gets written back. This is the coarse-grained durability path. The fine-grained path (WAL + fsync) comes in Chapter 6.

File Length Invariant

Notice the check in open():

if self.file_length % PAGE_SIZE != 0:
    raise IOError(...)

Our database file must always be an exact multiple of PAGE_SIZE. If it is not, something went wrong — a crash during a partial write, a file truncation, or external tampering. We refuse to open a corrupted file rather than silently misinterpreting page boundaries.

This invariant simplifies every offset calculation in the system: page_num * PAGE_SIZE is always valid for any page_num < num_pages.

With the Pager in place, we can now build the Table abstraction — the layer that maps logical row numbers to physical page locations.