Traversing the Tree
Traversing the Tree
With splitting implemented, our B-Tree can grow beyond a single leaf. But the cursor from Chapter 4 only knows how to iterate within a single leaf node. To perform a full table scan — or to search for a key in a multi-level tree — the cursor must descend from the root through internal nodes to the correct leaf, and then advance across leaf boundaries.
This section updates the Cursor for multi-level traversal and benchmarks the result.
Descending to a Leaf
We already wrote tree_find in the previous section as a recursive function. Let’s refactor it into the Cursor class:
@dataclass
class Cursor:
"""Points to a specific cell within a leaf node of a B-Tree."""
pager: Pager
page_num: int # always a leaf node page
cell_num: int # cell index within that leaf
end_of_table: bool
@classmethod
def table_start(cls, pager: Pager, root_page_num: int) -> "Cursor":
"""Position the cursor at the first cell (smallest key) in the tree."""
page_num = root_page_num
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
# Descend leftmost path until we reach a leaf
while node_type == INTERNAL_NODE_TYPE:
# The leftmost child is cell 0's child pointer
page_num = internal_node_child(page, 0)
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
num_cells = leaf_node_num_cells(page)
return cls(
pager=pager,
page_num=page_num,
cell_num=0,
end_of_table=(num_cells == 0),
)
@classmethod
def find(cls, pager: Pager, root_page_num: int, key: int) -> "Cursor":
"""Position the cursor at the cell containing `key`, or where it would be."""
page_num = root_page_num
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
# Descend through internal nodes
while node_type == INTERNAL_NODE_TYPE:
page_num = internal_node_find_child(page, key)
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
# Now at a leaf — binary search for the key
num_cells = leaf_node_num_cells(page)
index = leaf_node_find(page, key)
return cls(
pager=pager,
page_num=page_num,
cell_num=index,
end_of_table=(index >= num_cells),
)
Both table_start and find use iterative descent instead of recursion. This avoids stack depth issues for deep trees (though depth rarely exceeds 3-4 in practice).
Advancing Across Leaf Boundaries
The advance method from Chapter 4 stops at the end of a single leaf. For a multi-leaf tree, advancing past the last cell of one leaf should move to the first cell of the next leaf in key order.
This requires sibling pointers: each leaf node stores the page number of the next leaf. We add a 4-byte next_leaf field after the header:
Leaf Node Layout (updated):
┌─────────────────────────────────────────────────────────────┐
│ Node Type (1B) │ Is Root (1B) │ Num Cells (4B) │ Next Leaf (4B) │
├─────────────────────────────────────────────────────────────┤
│ Cell 0: [Key (4B)] [Value (36B)] │
│ ... │
└─────────────────────────────────────────────────────────────┘
# Updated constants
LEAF_NODE_NEXT_LEAF_OFFSET: Final[int] = LEAF_NODE_HEADER_SIZE # 6
LEAF_NODE_NEXT_LEAF_SIZE: Final[int] = 4
LEAF_NODE_CELLS_OFFSET: Final[int] = (
LEAF_NODE_HEADER_SIZE + LEAF_NODE_NEXT_LEAF_SIZE
) # 10
# Recalculate max cells with the new offset
LEAF_NODE_MAX_CELLS: Final[int] = (
(PAGE_SIZE - LEAF_NODE_CELLS_OFFSET) // LEAF_NODE_CELL_SIZE
) # (4096 - 10) // 40 = 102 (still 102 — we lost 4 bytes to the pointer,
# but 4086 // 40 = 102 with 6 bytes wasted instead of 10)
NO_SIBLING: Final[int] = 0 # Value meaning "no next leaf"
def leaf_node_next_leaf(page: memoryview) -> int:
"""Read the next-leaf page number. Returns 0 if there is no sibling."""
return struct.unpack_from("<I", page, LEAF_NODE_NEXT_LEAF_OFFSET)[0]
def leaf_node_set_next_leaf(page: memoryview, next_page_num: int) -> None:
"""Write the next-leaf page number."""
struct.pack_into("<I", page, LEAF_NODE_NEXT_LEAF_OFFSET, next_page_num)
Important: The cell offset functions must now use LEAF_NODE_CELLS_OFFSET (10) instead of the old LEAF_NODE_HEADER_SIZE (6). Update leaf_node_cell_offset:
def leaf_node_cell_offset(cell_num: int) -> int:
return LEAF_NODE_CELLS_OFFSET + cell_num * LEAF_NODE_CELL_SIZE
And the initialize_leaf_node function sets next_leaf to 0:
def initialize_leaf_node(page: memoryview, is_root: bool = False) -> None:
struct.pack_into(LEAF_NODE_HEADER_FORMAT, page, 0,
LEAF_NODE_TYPE, 1 if is_root else 0, 0)
leaf_node_set_next_leaf(page, NO_SIBLING)
page[LEAF_NODE_CELLS_OFFSET:] = b"\x00" * (PAGE_SIZE - LEAF_NODE_CELLS_OFFSET)
During leaf_node_split_and_insert, we must update the sibling chain:
# Inside leaf_node_split_and_insert, after creating the new right sibling:
leaf_node_set_next_leaf(new_page, leaf_node_next_leaf(old_page))
leaf_node_set_next_leaf(old_page, new_page_num)
This threads all leaves into a singly linked list in key order.
Updated Cursor.advance
def advance(self) -> None:
"""Move the cursor to the next cell, crossing leaf boundaries if needed."""
self.cell_num += 1
page = self.pager.get_page(self.page_num)
if self.cell_num >= leaf_node_num_cells(page):
# Past the end of this leaf — check for a sibling
next_page_num = leaf_node_next_leaf(page)
if next_page_num == NO_SIBLING:
self.end_of_table = True
else:
self.page_num = next_page_num
self.cell_num = 0
# Check if the sibling is also empty (shouldn't be, but defensive)
next_page = self.pager.get_page(next_page_num)
if leaf_node_num_cells(next_page) == 0:
self.end_of_table = True
With this change, a full table scan starting from Cursor.table_start will follow the leaf chain from the leftmost leaf to the rightmost, producing all rows in ascending key order.
Depth Tracking
For debugging and performance analysis, it helps to know the tree’s depth. A simple traversal from root to the leftmost leaf:
def tree_depth(pager: Pager, root_page_num: int) -> int:
"""Return the depth of the B-Tree (1 = single leaf, 2 = root + leaves, etc.)."""
depth = 1
page_num = root_page_num
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
while node_type == INTERNAL_NODE_TYPE:
page_num = internal_node_child(page, 0) # follow leftmost child
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
depth += 1
return depth
Tree Visualization for Debugging
A print_tree function helps verify tree structure during development:
def print_tree(pager: Pager, page_num: int, indent: int = 0) -> None:
"""Recursively print the tree structure for debugging."""
page = pager.get_page(page_num)
node_type = struct.unpack_from("B", page, 0)[0]
prefix = " " * indent
if node_type == LEAF_NODE_TYPE:
num_cells = leaf_node_num_cells(page)
print(f"{prefix}leaf (page {page_num}): {num_cells} cells", end="")
if num_cells > 0:
first_key = leaf_node_key(page, 0)
last_key = leaf_node_key(page, num_cells - 1)
print(f" keys [{first_key}..{last_key}]")
else:
print()
else:
num_keys = internal_node_num_keys(page)
right_child = internal_node_right_child(page)
print(f"{prefix}internal (page {page_num}): {num_keys} keys")
for i in range(num_keys):
child = internal_node_child(page, i)
key = internal_node_key(page, i)
print_tree(pager, child, indent + 1)
print(f"{prefix} key: {key}")
print_tree(pager, right_child, indent + 1)
Example output for a tree with 300 rows:
internal (page 0): 2 keys
leaf (page 1): 102 cells keys [0..101]
key: 102
leaf (page 2): 102 cells keys [102..203]
key: 204
leaf (page 3): 96 cells keys [204..299]
Benchmarking: B-Tree vs. Linear Scan
With the tree operational, we can compare search performance against the linear scan from Chapter 3:
import time
def benchmark_btree_search(num_rows: int) -> dict:
"""Benchmark B-Tree point lookups and full scan."""
import os
db_path = "bench_btree.db"
if os.path.exists(db_path):
os.remove(db_path)
pager = Pager(db_path)
root = pager.get_page(0)
initialize_leaf_node(root, is_root=True)
pager.num_pages = 1
# Insert N rows
for k in range(num_rows):
row = Row(id=k, username=f"user_{k:06d}")
tree_insert(pager, 0, k, row.serialize())
depth = tree_depth(pager, 0)
# Benchmark: single point lookup (worst case: last key)
start = time.perf_counter_ns()
for _ in range(1000):
cursor = Cursor.find(pager, 0, num_rows - 1)
_ = cursor.value()
lookup_ns = (time.perf_counter_ns() - start) // 1000
# Benchmark: full scan
start = time.perf_counter_ns()
cursor = Cursor.table_start(pager, 0)
count = 0
while not cursor.end_of_table:
_ = cursor.value()
cursor.advance()
count += 1
scan_ns = time.perf_counter_ns() - start
pager.close()
os.remove(db_path)
return {
"rows": num_rows,
"depth": depth,
"lookup_ns": lookup_ns,
"scan_ns": scan_ns,
"scan_count": count,
}
def run_benchmarks():
sizes = [100, 1_000, 10_000, 100_000]
print(f"{'Rows':>10} {'Depth':>6} {'Lookup (ns)':>12} {'Scan (ns)':>12}")
print("-" * 44)
for n in sizes:
r = benchmark_btree_search(n)
print(f"{r['rows']:>10} {r['depth']:>6} {r['lookup_ns']:>12,} {r['scan_ns']:>12,}")
run_benchmarks()
Expected results:
| Rows | Tree Depth | Lookup (ns) | Full Scan (ns) |
|---|---|---|---|
| 100 | 1 | ~500 | ~10,000 |
| 1,000 | 2 | ~800 | ~100,000 |
| 10,000 | 2 | ~900 | ~1,000,000 |
| 100,000 | 3 | ~1,200 | ~10,000,000 |
Key observation: Lookup time barely increases as rows grow from 100 to 100,000. That is $O(\log N)$ in action — specifically $O(d \cdot \log_2 b)$ where $d$ is the tree depth and $b$ is the branching factor. A depth-3 tree requires at most 3 page reads plus a binary search within the final leaf (7 comparisons). This is 3 pages touched, regardless of whether the table has 10,000 or 26 million rows.
Full scan time grows linearly because we visit every cell. There is no shortcut for “read everything.”
Complexity Summary
| Operation | Linear Scan (CH3) | B-Tree (CH5) |
|---|---|---|
| Point lookup | $O(N)$ | $O(\log N)$ |
| Insert | $O(1)$ amortized | $O(\log N)$ |
| Full scan | $O(N)$ | $O(N)$ |
| Space overhead | None | Internal nodes (~1% of data) |
The insert cost increased from $O(1)$ to $O(\log N)$ because we must descend to the correct leaf, and occasionally split. This is an acceptable trade-off: the database spends most of its time reading, not writing.
What We Built
At the end of Chapter 5, we have a complete B-Tree implementation:
- Leaf nodes store sorted key-value pairs (102 per node).
- Internal nodes route searches via binary search over keys (510 keys, 511 children per node).
- Splitting handles overflow by dividing nodes and promoting keys.
- The cursor descends from root to leaf and follows sibling pointers for scans.
- Tree depth grows logarithmically: depth 3 handles 26+ million rows.
What we have not addressed is durability. If the program crashes during a split — after writing the new sibling but before updating the parent — the tree is corrupted. Pages written to disk may be partial. The Pager calls os.write but never os.fsync, so the OS may reorder writes. Chapter 6 introduces the Write-Ahead Log (WAL), the mechanism that makes our storage engine crash-safe.