Skip to main content
python database internals building a persistent engine from scratch

Implementing the Pager

5 min read Chapter 5 of 21
Summary

This section details the implementation of the Pager...

This section details the implementation of the Pager class, which manages fixed-size pages for disk storage to optimize I/O operations. The core methods are get_page, retrieving a page by number and returning a memoryview with caching to reduce disk access, and flush_page, writing a cached page back to disk with error handling for partial writes to maintain consistency. Each page is 4KB (4096 bytes) with a layout consisting of a header using PAGE_HEADER_FORMAT ('<I I') for metadata and a data section for storage. The choice of page size involves trade-offs: 2KB pages have high I/O overhead but low memory usage, 4KB offers balanced performance and moderate memory, and 8KB reduces I/O frequency but increases memory footprint. The implementation uses Python 3.12+ with type hints, struct module for binary packing, and memoryview for efficient data manipulation. This enables efficient storage management crucial for database systems, with the pager caching pages in memory to minimize disk I/O and ensure data integrity.

Implementing the Pager

The Pager class from the previous section provides the full interface, but two methods deserve closer examination: get_page and flush. These are the hot path of the entire database — every read and every write passes through them. Getting them right means understanding cache behavior, partial-write hazards, and the contract between our code and the kernel.

get_page in Detail

Here is the method again, annotated step by step:

def get_page(self, page_num: int) -> memoryview:
    if page_num >= MAX_PAGES:
        raise ValueError(
            f"Page number {page_num} exceeds maximum {MAX_PAGES}"
        )

    if self.pages[page_num] is None:
        # 1. Allocate a zeroed buffer
        page = bytearray(PAGE_SIZE)

        # 2. If the page exists on disk, read it into the buffer
        if page_num < self.num_pages:
            os.lseek(self.file_descriptor, page_num * PAGE_SIZE, os.SEEK_SET)
            bytes_read = os.read(self.file_descriptor, PAGE_SIZE)
            page[: len(bytes_read)] = bytes_read

        # 3. Store in cache and track new pages
        self.pages[page_num] = page
        if page_num >= self.num_pages:
            self.num_pages = page_num + 1

    return memoryview(self.pages[page_num])

Step 1: Allocate a zeroed buffer. We always allocate PAGE_SIZE bytes, even if the page does not exist on disk yet. A brand-new page is all zeros — which, for a B-Tree leaf node, means “0 cells” in the header. This convention lets us append new pages without special initialization code.

Step 2: Read from disk. os.lseek moves the file cursor to the exact byte offset. os.read returns up to PAGE_SIZE bytes. If the file was truncated or corrupted, bytes_read might be shorter than expected — we copy whatever we got and leave the rest zeroed. A more robust implementation could check len(bytes_read) and raise on a short read from an existing page, but for now zeroing is safe.

Step 3: Cache and extend. The pages list acts as our page cache. On a cache hit (the if block is skipped), we return immediately — zero syscalls, zero copies. On a cache miss, we pay for one lseek + read but then never touch the disk again for this page until flush is called.

The return value is a memoryview wrapping the bytearray. This is important: the caller receives a view into our cache, not a copy. Any modifications the B-Tree makes to the returned memoryview are immediately visible in self.pages[page_num]. When we later call flush, we write those modifications to disk. No intermediate copy step is needed.

flush in Detail

def flush(self, page_num: int) -> None:
    if self.pages[page_num] is None:
        raise ValueError(f"Tried to flush null page {page_num}")

    offset = page_num * PAGE_SIZE
    os.lseek(self.file_descriptor, offset, os.SEEK_SET)

    written = os.write(self.file_descriptor, self.pages[page_num])
    if written != PAGE_SIZE:
        raise IOError(
            f"Partial write on page {page_num}: "
            f"{written}/{PAGE_SIZE} bytes"
        )

Why check for partial writes? The os.write syscall is not guaranteed to write all the bytes you pass it. On a full disk, a signal interruption, or certain NFS mounts, it can return fewer bytes than requested. A page that is only partially written to disk is indistinguishable from corruption — the next get_page will read a mix of old and new data. So we treat any short write as a hard error.

Why no fsync here? Notice that flush does not call os.fsync(). The data goes into the kernel’s write buffer but may not have reached the physical disk. This is intentional — calling fsync on every page write would destroy performance (a typical NVMe fsync costs 20–200 μs). In Chapter 6, we add a WAL that calls fsync at commit boundaries, giving us durability without per-page sync overhead.

The Close Protocol

def close(self) -> None:
    for i in range(self.num_pages):
        if self.pages[i] is not None:
            self.flush(i)
            self.pages[i] = None

    if self.file_descriptor is not None:
        os.close(self.file_descriptor)
        self.file_descriptor = None

On a clean shutdown, close iterates through every cached page and flushes it. Then it closes the file descriptor. After this call, all data that was modified in memory resides on disk (or at least in the kernel’s write buffer — a subsequent os.fsync before os.close would make it fully durable).

Error Handling Summary

ScenarioBehaviorRationale
Page number ≥ MAX_PAGESValueErrorPrevents unbounded memory allocation
Flush on a null pageValueErrorCaller logic error — nothing to write
Partial writeIOErrorCorrupt page on disk; fail fast
File not a whole number of pagesIOError in open()Refuse to interpret misaligned data
Page beyond current num_pagesAllocate zeroed buffer, extend countSupport appending new pages

This error strategy follows a simple principle: never silently produce corrupt data. A crash is recoverable (the WAL handles that). Silent corruption is not.

With get_page and flush proven out, we have a working Pager. The next step is building the Table class that maps logical row indices onto physical page slots.