Skip to main content
python database internals building a persistent engine from scratch

The Virtual Machine (VM)

5 min read Chapter 8 of 21
Summary

This section introduces the Virtual Machine (VM) as...

This section introduces the Virtual Machine (VM) as the execution engine for the database, driving Cursors based on hardcoded commands. Key implementations include the execute_insert method, which serializes Row objects using struct.pack and inserts them into the Table via a Cursor, and the execute_select method, which performs a linear scan using Cursor and deserializes Rows with memoryview. Integration with the REPL is demonstrated through a loop that reads commands, calls VM methods, and prints results. Command mappings show 'insert' and 'select' operations with corresponding data manipulations, while page layout details specify 8-byte headers and 36-byte row slots within 4096-byte pages. The architecture is modular, with flow from User Input through REPL, VM, Table, Pager, to Disk Storage, emphasizing separation of concerns and design simplicity with linear scans. New code interfaces include the VirtualMachine class with methods for insertion and selection, using Python 3.12+ with type hints, struct, and memoryview for low-level binary manipulation.

The Virtual Machine (VM)

The REPL reads a command string from the user. The VM executes it against the database. Between these two sits a thin dispatch layer: the VM receives a parsed command, decides whether it is an insert or a select, and drives the Cursor and Table to carry out the operation.

For now, the VM handles only two hardcoded commands — insert and select. There is no SQL parsing; the REPL splits the input string and passes the parts directly to the VM. We will add a proper tokenizer and parser in Chapter 7. The point of this section is to wire the execution path end to end: user input → VM → Cursor → Table → Pager → disk.

The VirtualMachine Class

# vm.py
from enum import Enum
from row import Row
from cursor import Cursor
from table import Table


class ExecuteResult(Enum):
    SUCCESS = "success"
    TABLE_FULL = "table_full"
    DUPLICATE_KEY = "duplicate_key"


class VirtualMachine:
    """Executes commands against the Table using Cursors."""

    def __init__(self, table: Table) -> None:
        self.table: Table = table

    def execute_insert(self, id_val: int, username: str) -> ExecuteResult:
        """Insert a new row at the end of the table."""
        row = Row(id_val, username)
        cursor = Cursor.table_end(self.table)

        page, offset = cursor.value()
        page[offset : offset + Row.ROW_SIZE] = row.serialize()
        self.table.num_rows += 1

        return ExecuteResult.SUCCESS

    def execute_select(self) -> list[Row]:
        """Return every row in the table via a full linear scan."""
        rows: list[Row] = []
        cursor = Cursor.table_start(self.table)

        while not cursor.end_of_table:
            page, offset = cursor.value()
            row = Row.deserialize(page[offset : offset + Row.ROW_SIZE])
            rows.append(row)
            cursor.advance()

        return rows

execute_insert creates a Row, obtains a cursor pointing one past the last row, serializes the row directly into the page buffer through the cursor’s value() view, and bumps the row count. The entire operation is zero-copy from serialization into the page cache — no intermediate buffer.

execute_select creates a cursor at the table start and walks forward one row at a time. Each row is deserialized from the page’s memoryview and collected into a list. This is an O(N) full scan — every row is touched exactly once.

Notice that neither method calls pager.flush(). Writes accumulate in the in-memory page cache until the table is closed. This is a deliberate choice: flushing on every insert would make performance dependent on disk latency. The Table.close() method handles the final flush.

The REPL

The Read-Eval-Print Loop ties everything together. It reads a line from stdin, dispatches to the VM, and prints the result:

# repl.py
from vm import VirtualMachine, ExecuteResult
from table import Table
from pager import Pager


def repl(db_filename: str) -> None:
    pager = Pager(db_filename)
    pager.open()
    table = Table(pager)
    vm = VirtualMachine(table)

    while True:
        try:
            line = input("db > ").strip()
        except EOFError:
            break

        if line == ".exit":
            table.close()
            print("Bye.")
            return

        if line.startswith("insert"):
            parts = line.split()
            if len(parts) != 3:
                print("Syntax: insert <id> <username>")
                continue
            try:
                id_val = int(parts[1])
                username = parts[2]
            except ValueError:
                print("Error: id must be an integer.")
                continue

            result = vm.execute_insert(id_val, username)
            if result == ExecuteResult.SUCCESS:
                print("Executed.")
            else:
                print(f"Error: {result.value}")

        elif line == "select":
            rows = vm.execute_select()
            for row in rows:
                print(f"({row.id_val}, {row.username})")

        else:
            print(f"Unrecognized command: {line!r}")

A sample session:

db > insert 1 alice
Executed.
db > insert 2 bob
Executed.
db > select
(1, alice)
(2, bob)
db > .exit
Bye.

The .exit meta-command calls table.close(), which flushes all dirty pages through the Pager before closing the file descriptor. Without this, data written during the session would live only in the page cache and be lost.

Data Flow

Here is the complete path for an insert 1 alice command:

User types "insert 1 alice"


REPL parses → id_val=1, username="alice"


VM.execute_insert(1, "alice")

  ├─ Row(1, "alice").serialize()  →  36 bytes

  ├─ Cursor.table_end(table)  →  row_num = table.num_rows

  ├─ cursor.value()  →  table.row_slot(row_num)
  │       │
  │       ├─ page_num = row_num // 113
  │       ├─ offset   = (row_num % 113) * 36
  │       └─ pager.get_page(page_num)  →  memoryview (cache hit or disk read)

  ├─ page[offset : offset+36] = serialized_bytes   ← zero-copy write

  └─ table.num_rows += 1

For select:

VM.execute_select()

  ├─ Cursor.table_start(table)  →  row_num = 0

  └─ while not cursor.end_of_table:
       ├─ cursor.value()  →  (page_view, offset)
       ├─ Row.deserialize(page_view[offset:offset+36])
       ├─ append to results
       └─ cursor.advance()  →  row_num += 1

Both paths are clean, composable, and built entirely from the primitives we defined in earlier chapters: Row.serialize/deserialize, Table.row_slot, Cursor.advance, and Pager.get_page.

What We Have and What We Lack

At this point, the database works. You can insert rows, select them back, persist to disk, and restart. But two critical weaknesses remain:

  1. Every select is a full scan. There is no way to find a single row without reading all of them. For 10 rows this is fine; for 10 million it is not.

  2. No duplicate detection. Inserting two rows with the same id succeeds silently. There is no uniqueness constraint.

Both of these require a sorted index structure — the B-Tree we begin in Chapter 4. Before building it, we need to understand exactly how bad the linear scan problem is. The next section measures it.