Skip to main content
python database internals building a persistent engine from scratch

The Disk vs. Memory Dichotomy

5 min read Chapter 2 of 21
Summary

This section explains the necessity of a custom...

This section explains the necessity of a custom memory manager, the Pager, in database systems due to the disk vs. memory dichotomy. It introduces volatile storage (e.g., RAM, ~100 ns access) for temporary data and non-volatile storage (e.g., disk, ~10 ms access) for persistence. The Pager manages fixed-size pages (4KB) as atomic units for I/O optimization, bridging volatile and non-volatile layers. A layered architecture overview includes REPL, Parser, Virtual Machine, B-Tree, Pager, and OS Interface. The memory hierarchy—from CPU caches to disk—highlights latency gaps requiring custom caching. A comparison table details storage types by speed, volatility, and database applications. Python 3.12+ code examples demonstrate low-level page manipulation using struct and memoryview. Key terms defined include Page (fixed-size data block), Volatile Storage (losses data without power), and Non-volatile Storage (retains data). The Pager ensures predictable performance through page allocation, caching, and eviction, critical for database efficiency.

The Disk vs. Memory Dichotomy

Your program’s variables live in RAM. RAM is fast — a read takes roughly 100 nanoseconds — but it is volatile. Kill the process, pull the power cord, and every byte vanishes. A database that only lived in RAM would be useless for any workload that requires durability.

Disk (whether spinning rust or flash) is the opposite: non-volatile, meaning data survives power loss, but agonizingly slow by comparison. A single random read from an SSD takes around 100 microseconds — a thousand times slower than RAM. A spinning HDD is worse: 10 milliseconds per seek, or a hundred thousand times slower.

This gap — five to six orders of magnitude — is the central tension in every storage engine. Our entire architecture exists to hide it.

The Memory Hierarchy

To appreciate why we need a custom Pager, look at the full latency landscape:

Storage LevelTypical LatencyVolatile?Capacity
CPU L1 Cache~1 nsYes64 KB
CPU L3 Cache~10 nsYes8–32 MB
RAM (DDR5)~100 nsYes8–256 GB
NVMe SSD~100 μsNo256 GB – 4 TB
SATA SSD~500 μsNo256 GB – 4 TB
HDD (7200 RPM)~10 msNo1–20 TB

The jump from RAM to SSD is a 1000× cliff. Everything above the cliff is volatile; everything below is persistent. A database engine must keep hot data above the cliff (in RAM) while ensuring all committed data eventually reaches below the cliff (on disk). That is the Pager’s job.

Why Not Let the OS Handle It?

Modern operating systems already cache file data in the page cache (also called the buffer cache). When you read() a file, the kernel may serve the data from RAM without touching the disk at all. So why build our own caching layer?

Three reasons:

  1. Eviction control. The OS page cache uses a general-purpose LRU policy. It has no idea which of our database pages are “hot” (frequently queried) versus “cold.” It might evict a critical B-Tree root page to make room for a log file some other process is writing. Our Pager can pin important pages and evict strategically.

  2. Write ordering. Durability requires that we write the WAL (Write-Ahead Log) record before the modified data page. The OS buffer cache provides no such ordering guarantee — it can flush dirty pages in any sequence. We need explicit fsync() calls at precise moments, which means we must control when and what gets written.

  3. Page-aligned I/O. If we always read and write in exact multiples of PAGE_SIZE, every I/O operation maps cleanly to OS pages. No partial reads, no read-modify-write cycles in the kernel. The OS page cache works best when the application cooperates with it, not when it fights against it.

The Page: Atomic Unit of Transfer

We define a page as a fixed-size block of PAGE_SIZE (4096) bytes. Every transfer between memory and disk moves exactly one page. No more, no less.

This constraint gives us three properties:

  • Alignment. Every page starts at a file offset that is a multiple of 4096. The OS can map it directly to a virtual memory page without splitting.
  • Atomicity (best-effort). On most filesystems, a 4 KB aligned write is atomic — it either fully succeeds or fully fails. This is not guaranteed by POSIX, but it holds on ext4 and APFS for single-sector writes. We will reinforce it with the WAL in Chapter 6.
  • Simplicity. The Pager never needs to deal with variable-length reads. Given a page number n, the byte offset is always n * PAGE_SIZE. Given a byte offset, the page number is always offset // PAGE_SIZE.

Think of the database file as an array of pages:

File on disk:
┌──────────┬──────────┬──────────┬──────────┬─────┐
│  Page 0  │  Page 1  │  Page 2  │  Page 3  │ ... │
│ 4096 B   │ 4096 B   │ 4096 B   │ 4096 B   │     │
└──────────┴──────────┴──────────┴──────────┴─────┘
offset: 0      4096      8192      12288

The Pager’s responsibility is to make this array feel like it lives in memory. When the B-Tree asks for page 7, the Pager checks its cache. On a hit, it returns a memoryview instantly. On a miss, it seeks to offset 7 * 4096, reads 4096 bytes, stores them in the cache, and returns the view. The B-Tree never knows whether the data came from RAM or disk.

This is the abstraction we will build in the next chapter. For now, the takeaway is this: the Pager exists because the speed gap between RAM and disk is too large to ignore, the OS page cache is too general to rely on, and fixed-size pages give us the alignment and simplicity needed to build everything else on top.