Implementing the Pager

The Pager class from the previous section provides the full interface, but two methods deserve closer examination: get_page and flush. These are the hot path of the entire database — every read and every write passes through them. Getting them right means understanding cache behavior, partial-write hazards, and the contract between our code and the kernel.

`get_page` in Detail

Here is the method again, annotated step by step:

def get_page(self, page_num: int) -> memoryview:
    if page_num >= MAX_PAGES:
        raise ValueError(
            f"Page number {page_num} exceeds maximum {MAX_PAGES}"
        )

    if self.pages[page_num] is None:
        # 1. Allocate a zeroed buffer
        page = bytearray(PAGE_SIZE)

        # 2. If the page exists on disk, read it into the buffer
        if page_num < self.num_pages:
            os.lseek(self.file_descriptor, page_num * PAGE_SIZE, os.SEEK_SET)
            bytes_read = os.read(self.file_descriptor, PAGE_SIZE)
            page[: len(bytes_read)] = bytes_read

        # 3. Store in cache and track new pages
        self.pages[page_num] = page
        if page_num >= self.num_pages:
            self.num_pages = page_num + 1

    return memoryview(self.pages[page_num])

Step 1: Allocate a zeroed buffer. We always allocate PAGE_SIZE bytes, even if the page does not exist on disk yet. A brand-new page is all zeros — which, for a B-Tree leaf node, means “0 cells” in the header. This convention lets us append new pages without special initialization code.

Step 2: Read from disk. os.lseek moves the file cursor to the exact byte offset. os.read returns up to PAGE_SIZE bytes. If the file was truncated or corrupted, bytes_read might be shorter than expected — we copy whatever we got and leave the rest zeroed. A more robust implementation could check len(bytes_read) and raise on a short read from an existing page, but for now zeroing is safe.

Step 3: Cache and extend. The pages list acts as our page cache. On a cache hit (the if block is skipped), we return immediately — zero syscalls, zero copies. On a cache miss, we pay for one lseek + read but then never touch the disk again for this page until flush is called.

The return value is a memoryview wrapping the bytearray. This is important: the caller receives a view into our cache, not a copy. Any modifications the B-Tree makes to the returned memoryview are immediately visible in self.pages[page_num]. When we later call flush, we write those modifications to disk. No intermediate copy step is needed.

`flush` in Detail

def flush(self, page_num: int) -> None:
    if self.pages[page_num] is None:
        raise ValueError(f"Tried to flush null page {page_num}")

    offset = page_num * PAGE_SIZE
    os.lseek(self.file_descriptor, offset, os.SEEK_SET)

    written = os.write(self.file_descriptor, self.pages[page_num])
    if written != PAGE_SIZE:
        raise IOError(
            f"Partial write on page {page_num}: "
            f"{written}/{PAGE_SIZE} bytes"
        )

Why check for partial writes? The os.write syscall is not guaranteed to write all the bytes you pass it. On a full disk, a signal interruption, or certain NFS mounts, it can return fewer bytes than requested. A page that is only partially written to disk is indistinguishable from corruption — the next get_page will read a mix of old and new data. So we treat any short write as a hard error.

Why no fsync here? Notice that flush does not call os.fsync(). The data goes into the kernel’s write buffer but may not have reached the physical disk. This is intentional — calling fsync on every page write would destroy performance (a typical NVMe fsync costs 20–200 μs). In Chapter 6, we add a WAL that calls fsync at commit boundaries, giving us durability without per-page sync overhead.

The Close Protocol

def close(self) -> None:
    for i in range(self.num_pages):
        if self.pages[i] is not None:
            self.flush(i)
            self.pages[i] = None

    if self.file_descriptor is not None:
        os.close(self.file_descriptor)
        self.file_descriptor = None

On a clean shutdown, close iterates through every cached page and flushes it. Then it closes the file descriptor. After this call, all data that was modified in memory resides on disk (or at least in the kernel’s write buffer — a subsequent os.fsync before os.close would make it fully durable).

Error Handling Summary

Scenario	Behavior	Rationale
Page number ≥ `MAX_PAGES`	`ValueError`	Prevents unbounded memory allocation
Flush on a null page	`ValueError`	Caller logic error — nothing to write
Partial write	`IOError`	Corrupt page on disk; fail fast
File not a whole number of pages	`IOError` in `open()`	Refuse to interpret misaligned data
Page beyond current `num_pages`	Allocate zeroed buffer, extend count	Support appending new pages

This error strategy follows a simple principle: never silently produce corrupt data. A crash is recoverable (the WAL handles that). Silent corruption is not.

With get_page and flush proven out, we have a working Pager. The next step is building the Table class that maps logical row indices onto physical page slots.

get_page in Detail

flush in Detail

The Close Protocol

Error Handling Summary

`get_page` in Detail

`flush` in Detail