Skip to main content
python database internals building a persistent engine from scratch

The Table Abstraction

5 min read Chapter 6 of 21
Summary

The Table class interfaces the Pager with higher...

The Table class interfaces the Pager with higher database components to manage row storage, retrieval, and persistence. It calculates row addresses using get_row_address, which determines page numbers and byte offsets from row indices based on fixed sizes: page size 4096 bytes, header size 8 bytes, and row size 36 bytes. Data persistence is enabled by leveraging Pager methods for disk I/O. The page layout comprises a header section and data section with sequential row slots, optimizing access. Key terminology introduced includes Table (managing row operations) and Row Slot (fixed positions for rows in pages). The implementation emphasizes low-level logic with Python 3.12+ type hints and binary data manipulation via struct, though memoryview usage is implied but not fully demonstrated in the placeholder insert_row method.

The Table Abstraction

The Pager moves opaque 4 KB blocks. The Row class knows how to serialize 36 bytes. Neither one knows how to answer the question: “Where does row number 500 live?” That is the Table’s job.

The Table sits between the VM (which thinks in terms of row indices) and the Pager (which thinks in terms of page numbers). It translates a logical row index into a physical (page_num, byte_offset) pair, and it provides methods to insert and read rows through the Pager.

Row Slot Calculation

In this pre-B-Tree phase, rows are stored sequentially — row 0 first, then row 1, then row 2, and so on, packed tightly within each page. There is no per-page header yet (B-Tree node headers come in Chapter 4). Every byte of the page is available for row data.

Given PAGE_SIZE = 4096 and ROW_SIZE = 36:

rows_per_page = PAGE_SIZE // ROW_SIZE  # 4096 // 36 = 113 (with 28 bytes of waste)

For any row index i, the physical location is:

page_num      = i // rows_per_page
byte_offset   = (i % rows_per_page) * ROW_SIZE

Here is the mapping for the first few rows and the page boundary:

Row IndexPageByte OffsetNotes
000First row on first page
1036
2072
11204032Last row on first page (4032 + 36 = 4068 ≤ 4096)
11310First row on second page
114136

The 28 wasted bytes at the end of each page (4096 − 113 × 36 = 28) are an acceptable trade-off for the simplicity of fixed-size rows without spanning page boundaries. No row ever straddles two pages — if it did, a single page read could not return a complete row, and the Pager’s contract would break down.

The Table Class

# table.py
from row import Row
from pager import Pager
from constants import PAGE_SIZE

class Table:
    """Maps logical row indices to physical page locations.
    Stores rows sequentially (pre-B-Tree)."""

    ROWS_PER_PAGE: int = PAGE_SIZE // Row.ROW_SIZE  # 113

    def __init__(self, pager: Pager) -> None:
        self.pager: Pager = pager
        self.num_rows: int = 0

    def row_slot(self, row_num: int) -> tuple[memoryview, int]:
        """Return (page_view, byte_offset) for the given row number."""
        page_num = row_num // self.ROWS_PER_PAGE
        row_offset = (row_num % self.ROWS_PER_PAGE) * Row.ROW_SIZE
        page = self.pager.get_page(page_num)
        return page, row_offset

    def insert(self, row: Row) -> None:
        """Append a row at the next available slot."""
        page, offset = self.row_slot(self.num_rows)
        data = row.serialize()
        page[offset : offset + Row.ROW_SIZE] = data
        self.num_rows += 1

    def read(self, row_num: int) -> Row:
        """Read the row at the given index."""
        page, offset = self.row_slot(row_num)
        buf = page[offset : offset + Row.ROW_SIZE]
        return Row.deserialize(buf)

    def close(self) -> None:
        """Flush all pages through the Pager and close the file."""
        num_full_pages = self.num_rows // self.ROWS_PER_PAGE
        for i in range(num_full_pages + 1):
            if self.pager.pages[i] is not None:
                self.pager.flush(i)
        self.pager.close()

Key details:

  • row_slot returns a view and an offset, not a copy. The caller writes directly into the Pager’s cached page buffer. This is zero-copy insertion — the serialized bytes go from Row.serialize() straight into the page cache with a single slice assignment.

  • num_rows is tracked in memory. For now, we do not persist the row count to disk. When the program restarts, num_rows resets to zero. We will fix this when we move to B-Tree storage, where the tree structure implicitly encodes the row count.

  • close flushes only the pages that contain data. There is no point in writing an empty page to disk.

Persistence Across Restarts

Even without persisting num_rows, the data survives. Here is a demonstration:

# Session 1: insert rows
pager = Pager("test.db")
pager.open()
table = Table(pager)

table.insert(Row(1, "alice"))
table.insert(Row(2, "bob"))
table.insert(Row(3, "charlie"))

table.close()  # flushes pages to disk
# Session 2: read rows back (after restarting the program)
pager = Pager("test.db")
pager.open()

# Read raw page data — the rows are still there
page0 = pager.get_page(0)
row0 = Row.deserialize(page0[0:36])
row1 = Row.deserialize(page0[36:72])
row2 = Row.deserialize(page0[72:108])

print(row0)  # Row(id=1, username='alice')
print(row1)  # Row(id=2, username='bob')
print(row2)  # Row(id=3, username='charlie')

The Pager.close() call in Session 1 wrote the page to disk. Session 2’s get_page(0) read it back. The binary data is identical — struct.pack produced the same bytes, and struct.unpack recovered the same values. This is the fundamental promise of the storage engine: what you write is what you get back, even across process boundaries.

The limitation is obvious: without a stored row count, we cannot tell where valid data ends and zero-padding begins. And without sorted keys, every lookup requires scanning all rows. Both problems are solved in the next chapter when we introduce the Cursor and the Virtual Machine.