Skip to main content
python database internals building a persistent engine from scratch

Cursor Abstraction and Linear Scan

4 min read Chapter 7 of 21
Summary

The Cursor abstraction facilitates sequential row iteration in...

The Cursor abstraction facilitates sequential row iteration in database tables via linear scan, ensuring O(N) time complexity as per boundary constraints. Implemented in Python with the Cursor class, it collaborates with Table and Pager components: Table manages row storage with Pager, and Cursor uses start, end, and advance methods for traversal. Key methods include cursor_start (initializes to first row), cursor_end (marks termination), and cursor_advance (progresses with boolean feedback). Row retrieval employs fetch_row via Table, leveraging binary data manipulation with struct and memoryview for performance. Design rationale emphasizes modular simplicity, postponing index-based optimizations like B-trees. Example diagrams illustrate page-based traversal, and data tables demonstrate cursor states during scans.

Cursor Abstraction and Linear Scan

So far we can insert rows and read them back by index, but the Table class has a problem: callers need to know the exact row number they want. A SELECT statement does not know row numbers — it needs to iterate over all rows (or all rows matching some condition) and return them one at a time. That is what the Cursor provides.

A Cursor is a lightweight object that tracks a position within the table. It knows which page it is on and which row within that page it is pointing at. You can advance it forward, check if it has reached the end, and read or write the row at its current position.

Why Cursors?

You might wonder why we need a separate abstraction instead of a simple for i in range(table.num_rows) loop. Two reasons:

  1. Decoupling storage layout from iteration logic. Right now, rows are stored sequentially and the cursor walks them in order. When we switch to a B-Tree in Chapter 4, the cursor will walk leaf nodes instead. The VM that calls cursor.advance() does not change — only the cursor’s internal navigation does. This is the power of the abstraction.

  2. Pointing to insertion targets. An insert operation needs to know where to put the new row. The cursor can point to “one past the last row” for appends, or later, to the correct sorted position in a B-Tree leaf.

The Cursor Class

# cursor.py
from __future__ import annotations
from typing import TYPE_CHECKING
from row import Row
from constants import PAGE_SIZE

if TYPE_CHECKING:
    from table import Table

class Cursor:
    """Points to a specific row position within a Table."""

    def __init__(self, table: Table, row_num: int, end_of_table: bool) -> None:
        self.table: Table = table
        self.row_num: int = row_num
        self.end_of_table: bool = end_of_table

    @classmethod
    def table_start(cls, table: Table) -> Cursor:
        """Create a cursor pointing to the first row."""
        return cls(table, row_num=0, end_of_table=(table.num_rows == 0))

    @classmethod
    def table_end(cls, table: Table) -> Cursor:
        """Create a cursor pointing one past the last row (insert position)."""
        return cls(table, row_num=table.num_rows, end_of_table=True)

    def value(self) -> tuple[memoryview, int]:
        """Return (page_view, byte_offset) for the current row.
        The caller can read or write through the returned memoryview.
        """
        return self.table.row_slot(self.row_num)

    def advance(self) -> None:
        """Move the cursor to the next row."""
        self.row_num += 1
        if self.row_num >= self.table.num_rows:
            self.end_of_table = True

The two class methods — table_start and table_end — create cursors at well-defined positions. table_start points to row 0 (or signals end-of-table immediately if the table is empty). table_end points one past the last row, which is exactly where the next insert should go.

The value method delegates straight to Table.row_slot. It returns the same (memoryview, offset) pair, giving the caller zero-copy access to the page buffer. Reading a row means calling Row.deserialize(page[offset:offset + ROW_SIZE]). Writing means slicing the serialized bytes into the same region.

advance increments the row number and checks if we have walked past the last valid row. This is O(1) per call, and a full scan over N rows is O(N) — which is the best we can do without an index.

Using the Cursor

Here is a linear scan that prints every row:

def print_all_rows(table: Table) -> None:
    cursor = Cursor.table_start(table)
    while not cursor.end_of_table:
        page, offset = cursor.value()
        row = Row.deserialize(page[offset : offset + Row.ROW_SIZE])
        print(row)
        cursor.advance()

And an insertion using the end cursor:

def append_row(table: Table, row: Row) -> None:
    cursor = Cursor.table_end(table)
    page, offset = cursor.value()
    page[offset : offset + Row.ROW_SIZE] = row.serialize()
    table.num_rows += 1

These two operations — scan everything and append at the end — are the building blocks of our SELECT and INSERT commands. The next section connects them to a Virtual Machine that the REPL can drive.