Benchmarking the Linear Scan

We have a working database: insert rows, select them back, persist to disk. But how does it perform? The answer matters because it tells us exactly why we need a B-Tree. Without measuring, the motivation for the next three chapters would be hand-waving. Let’s make it concrete.

The Cost of O(N)

Our execute_select walks every row in sequence. If there are N rows, the cursor calls advance() exactly N times, and each call reads 36 bytes from a page buffer. The complexity is linear: double the rows, double the time.

For a SELECT that returns a single row by id — say, “find user 5000” — linear scan is catastrophic. The database still has to touch all N rows because it does not know where id 5000 lives. There is no index, no sorted order, no shortcut.

Benchmarking Script

The following script inserts a configurable number of rows and then measures how long a full scan takes:

# benchmark.py
import time
from pager import Pager
from table import Table
from vm import VirtualMachine
from row import Row


def benchmark(num_rows: int, db_file: str = ":memory:") -> None:
    """Insert num_rows, then time a full select."""
    pager = Pager(db_file)
    pager.open()
    table = Table(pager)
    vm = VirtualMachine(table)

    # --- Insert phase ---
    t0 = time.perf_counter()
    for i in range(num_rows):
        vm.execute_insert(i, f"user{i:06d}")
    t_insert = time.perf_counter() - t0

    # --- Select phase ---
    t0 = time.perf_counter()
    rows = vm.execute_select()
    t_select = time.perf_counter() - t0

    assert len(rows) == num_rows

    pages_used = (num_rows // 113) + 1
    print(
        f"Rows: {num_rows:>8,} | "
        f"Pages: {pages_used:>5,} | "
        f"Insert: {t_insert:.4f}s | "
        f"Select: {t_select:.4f}s | "
        f"Per-row select: {t_select / num_rows * 1e6:.2f} μs"
    )

    table.close()


if __name__ == "__main__":
    for n in [1_000, 5_000, 10_000, 50_000]:
        benchmark(n, f"/tmp/bench_{n}.db")

Expected Results

On a typical machine, the numbers look roughly like this:

Rows	Pages	Select Time	Per-Row Cost	Speedup Needed for O(log N)
1,000	9	~0.002 s	~2 μs	1× (acceptable)
5,000	45	~0.010 s	~2 μs	5×
10,000	89	~0.020 s	~2 μs	10×
50,000	443	~0.100 s	~2 μs	50×
100,000	885	~0.200 s	~2 μs	100×
1,000,000	8,850	~2.000 s	~2 μs	1,000×

The per-row cost stays roughly constant (as expected for sequential memory access), so the total time scales linearly. At 1 million rows, a full scan that touches every row takes about 2 seconds. That might be tolerable for analytics, but for a point query — “give me the row where id = 500,000” — spending 1 second to discard 999,999 irrelevant rows is unacceptable.

Visualizing the Scaling

Select time (s)
    │
2.0 │                                        ╱
    │                                      ╱
1.5 │                                   ╱
    │                                ╱
1.0 │                            ╱
    │                        ╱
0.5 │                   ╱
    │              ╱
0.1 │        ╱
    │   ╱
    └────────────────────────────────────── Rows
     0   100K  200K  300K  400K  500K  ... 1M

This is a straight line — the textbook signature of O(N). Compare it to what a B-Tree gives us:

A B-Tree with branching factor 113 (our page capacity) and 1 million rows has a depth of about $\lceil \log_{113}(1{,}000{,}000) \rceil = 3$ levels.
A point query reads exactly 3 pages instead of 8,850. That is a 2,950× speedup.

Why Not Just Sort the Rows?

You might think: “What if we keep the rows sorted by id and do binary search?” That helps for reads — binary search over 8,850 pages is about 13 seeks — but kills writes. Inserting a row in sorted order requires shifting all subsequent rows forward by 36 bytes. For a million rows, that means rewriting thousands of pages on every insert.

A B-Tree solves both problems. Reads are O(log N) because the tree is sorted and balanced. Writes are also O(log N) because insertions only modify one leaf node (plus occasionally splitting and updating a parent). The tree maintains sorted order structurally, without ever shifting bulk data.

That is what we build next.