Benchmarking the Linear Scan
SummaryThis section introduces benchmarking to measure linear scan...
This section introduces benchmarking to measure linear scan...
This section introduces benchmarking to measure linear scan performance degradation with increasing data volume. Linear scan, a traversal method accessing all rows sequentially without indexes, exhibits O(N) time complexity, meaning execution time grows linearly with row count. Key technical details include a fixed row size of 36 bytes and page size of 4096 bytes with an 8-byte header, allowing approximately 113 rows per page. A Python 3.12+ benchmarking script demonstrates this by inserting rows (default 10,000) and timing SELECT queries using components like Pager, Table, Row, and VirtualMachine, all leveraging struct and memoryview for binary data manipulation. The results, presented in a table, show estimated select times scaling linearly, e.g., from 0.001 seconds for 1,000 rows to 1.0 second for 1,000,000 rows, illustrating the performance bottleneck. A diagram visually represents the sequential page access during linear scan. The analysis concludes that while linear scan is fundamental, its inefficiency with large datasets motivates the need for tree-based indexing for improved performance.
Benchmarking the Linear Scan
We have a working database: insert rows, select them back, persist to disk. But how does it perform? The answer matters because it tells us exactly why we need a B-Tree. Without measuring, the motivation for the next three chapters would be hand-waving. Let’s make it concrete.
The Cost of O(N)
Our execute_select walks every row in sequence. If there are N rows, the cursor calls advance() exactly N times, and each call reads 36 bytes from a page buffer. The complexity is linear: double the rows, double the time.
For a SELECT that returns a single row by id — say, “find user 5000” — linear scan is catastrophic. The database still has to touch all N rows because it does not know where id 5000 lives. There is no index, no sorted order, no shortcut.
Benchmarking Script
The following script inserts a configurable number of rows and then measures how long a full scan takes:
# benchmark.py
import time
from pager import Pager
from table import Table
from vm import VirtualMachine
from row import Row
def benchmark(num_rows: int, db_file: str = ":memory:") -> None:
"""Insert num_rows, then time a full select."""
pager = Pager(db_file)
pager.open()
table = Table(pager)
vm = VirtualMachine(table)
# --- Insert phase ---
t0 = time.perf_counter()
for i in range(num_rows):
vm.execute_insert(i, f"user{i:06d}")
t_insert = time.perf_counter() - t0
# --- Select phase ---
t0 = time.perf_counter()
rows = vm.execute_select()
t_select = time.perf_counter() - t0
assert len(rows) == num_rows
pages_used = (num_rows // 113) + 1
print(
f"Rows: {num_rows:>8,} | "
f"Pages: {pages_used:>5,} | "
f"Insert: {t_insert:.4f}s | "
f"Select: {t_select:.4f}s | "
f"Per-row select: {t_select / num_rows * 1e6:.2f} μs"
)
table.close()
if __name__ == "__main__":
for n in [1_000, 5_000, 10_000, 50_000]:
benchmark(n, f"/tmp/bench_{n}.db")
Expected Results
On a typical machine, the numbers look roughly like this:
| Rows | Pages | Select Time | Per-Row Cost | Speedup Needed for O(log N) |
|---|---|---|---|---|
| 1,000 | 9 | ~0.002 s | ~2 μs | 1× (acceptable) |
| 5,000 | 45 | ~0.010 s | ~2 μs | 5× |
| 10,000 | 89 | ~0.020 s | ~2 μs | 10× |
| 50,000 | 443 | ~0.100 s | ~2 μs | 50× |
| 100,000 | 885 | ~0.200 s | ~2 μs | 100× |
| 1,000,000 | 8,850 | ~2.000 s | ~2 μs | 1,000× |
The per-row cost stays roughly constant (as expected for sequential memory access), so the total time scales linearly. At 1 million rows, a full scan that touches every row takes about 2 seconds. That might be tolerable for analytics, but for a point query — “give me the row where id = 500,000” — spending 1 second to discard 999,999 irrelevant rows is unacceptable.
Visualizing the Scaling
Select time (s)
│
2.0 │ ╱
│ ╱
1.5 │ ╱
│ ╱
1.0 │ ╱
│ ╱
0.5 │ ╱
│ ╱
0.1 │ ╱
│ ╱
└────────────────────────────────────── Rows
0 100K 200K 300K 400K 500K ... 1M
This is a straight line — the textbook signature of O(N). Compare it to what a B-Tree gives us:
- A B-Tree with branching factor 113 (our page capacity) and 1 million rows has a depth of about $\lceil \log_{113}(1{,}000{,}000) \rceil = 3$ levels.
- A point query reads exactly 3 pages instead of 8,850. That is a 2,950× speedup.
Why Not Just Sort the Rows?
You might think: “What if we keep the rows sorted by id and do binary search?” That helps for reads — binary search over 8,850 pages is about 13 seeks — but kills writes. Inserting a row in sorted order requires shifting all subsequent rows forward by 36 bytes. For a million rows, that means rewriting thousands of pages on every insert.
A B-Tree solves both problems. Reads are O(log N) because the tree is sorted and balanced. Writes are also O(log N) because insertions only modify one leaf node (plus occasionally splitting and updating a parent). The tree maintains sorted order structurally, without ever shifting bulk data.
That is what we build next.