The Table Abstraction
SummaryThe Table class interfaces the Pager with higher...
The Table class interfaces the Pager with higher...
The Table class interfaces the Pager with higher database components to manage row storage, retrieval, and persistence. It calculates row addresses using get_row_address, which determines page numbers and byte offsets from row indices based on fixed sizes: page size 4096 bytes, header size 8 bytes, and row size 36 bytes. Data persistence is enabled by leveraging Pager methods for disk I/O. The page layout comprises a header section and data section with sequential row slots, optimizing access. Key terminology introduced includes Table (managing row operations) and Row Slot (fixed positions for rows in pages). The implementation emphasizes low-level logic with Python 3.12+ type hints and binary data manipulation via struct, though memoryview usage is implied but not fully demonstrated in the placeholder insert_row method.
The Table Abstraction
The Pager moves opaque 4 KB blocks. The Row class knows how to serialize 36 bytes. Neither one knows how to answer the question: “Where does row number 500 live?” That is the Table’s job.
The Table sits between the VM (which thinks in terms of row indices) and the Pager (which thinks in terms of page numbers). It translates a logical row index into a physical (page_num, byte_offset) pair, and it provides methods to insert and read rows through the Pager.
Row Slot Calculation
In this pre-B-Tree phase, rows are stored sequentially — row 0 first, then row 1, then row 2, and so on, packed tightly within each page. There is no per-page header yet (B-Tree node headers come in Chapter 4). Every byte of the page is available for row data.
Given PAGE_SIZE = 4096 and ROW_SIZE = 36:
rows_per_page = PAGE_SIZE // ROW_SIZE # 4096 // 36 = 113 (with 28 bytes of waste)
For any row index i, the physical location is:
page_num = i // rows_per_page
byte_offset = (i % rows_per_page) * ROW_SIZE
Here is the mapping for the first few rows and the page boundary:
| Row Index | Page | Byte Offset | Notes |
|---|---|---|---|
| 0 | 0 | 0 | First row on first page |
| 1 | 0 | 36 | |
| 2 | 0 | 72 | |
| 112 | 0 | 4032 | Last row on first page (4032 + 36 = 4068 ≤ 4096) |
| 113 | 1 | 0 | First row on second page |
| 114 | 1 | 36 |
The 28 wasted bytes at the end of each page (4096 − 113 × 36 = 28) are an acceptable trade-off for the simplicity of fixed-size rows without spanning page boundaries. No row ever straddles two pages — if it did, a single page read could not return a complete row, and the Pager’s contract would break down.
The Table Class
# table.py
from row import Row
from pager import Pager
from constants import PAGE_SIZE
class Table:
"""Maps logical row indices to physical page locations.
Stores rows sequentially (pre-B-Tree)."""
ROWS_PER_PAGE: int = PAGE_SIZE // Row.ROW_SIZE # 113
def __init__(self, pager: Pager) -> None:
self.pager: Pager = pager
self.num_rows: int = 0
def row_slot(self, row_num: int) -> tuple[memoryview, int]:
"""Return (page_view, byte_offset) for the given row number."""
page_num = row_num // self.ROWS_PER_PAGE
row_offset = (row_num % self.ROWS_PER_PAGE) * Row.ROW_SIZE
page = self.pager.get_page(page_num)
return page, row_offset
def insert(self, row: Row) -> None:
"""Append a row at the next available slot."""
page, offset = self.row_slot(self.num_rows)
data = row.serialize()
page[offset : offset + Row.ROW_SIZE] = data
self.num_rows += 1
def read(self, row_num: int) -> Row:
"""Read the row at the given index."""
page, offset = self.row_slot(row_num)
buf = page[offset : offset + Row.ROW_SIZE]
return Row.deserialize(buf)
def close(self) -> None:
"""Flush all pages through the Pager and close the file."""
num_full_pages = self.num_rows // self.ROWS_PER_PAGE
for i in range(num_full_pages + 1):
if self.pager.pages[i] is not None:
self.pager.flush(i)
self.pager.close()
Key details:
-
row_slotreturns a view and an offset, not a copy. The caller writes directly into the Pager’s cached page buffer. This is zero-copy insertion — the serialized bytes go fromRow.serialize()straight into the page cache with a single slice assignment. -
num_rowsis tracked in memory. For now, we do not persist the row count to disk. When the program restarts,num_rowsresets to zero. We will fix this when we move to B-Tree storage, where the tree structure implicitly encodes the row count. -
closeflushes only the pages that contain data. There is no point in writing an empty page to disk.
Persistence Across Restarts
Even without persisting num_rows, the data survives. Here is a demonstration:
# Session 1: insert rows
pager = Pager("test.db")
pager.open()
table = Table(pager)
table.insert(Row(1, "alice"))
table.insert(Row(2, "bob"))
table.insert(Row(3, "charlie"))
table.close() # flushes pages to disk
# Session 2: read rows back (after restarting the program)
pager = Pager("test.db")
pager.open()
# Read raw page data — the rows are still there
page0 = pager.get_page(0)
row0 = Row.deserialize(page0[0:36])
row1 = Row.deserialize(page0[36:72])
row2 = Row.deserialize(page0[72:108])
print(row0) # Row(id=1, username='alice')
print(row1) # Row(id=2, username='bob')
print(row2) # Row(id=3, username='charlie')
The Pager.close() call in Session 1 wrote the page to disk. Session 2’s get_page(0) read it back. The binary data is identical — struct.pack produced the same bytes, and struct.unpack recovered the same values. This is the fundamental promise of the storage engine: what you write is what you get back, even across process boundaries.
The limitation is obvious: without a stored row count, we cannot tell where valid data ends and zero-padding begins. And without sorted keys, every lookup requires scanning all rows. Both problems are solved in the next chapter when we introduce the Cursor and the Virtual Machine.