Skip to main content
data systems from the ground up

Indexing: The Price You Pay to Read Fast

8 min read Chapter 4 of 36

Indexing: The Price You Pay to Read Fast

Chapter 1 built a database from an append-only file and a linear scan. The write path was fast. The read path was O(n) in the total number of records and became unusable as the file grew past a few hundred thousand records.

An index solves the read problem by maintaining a secondary data structure that maps lookup keys to their locations in the data file. The index makes reads fast. It also makes every write slower, because the index must be updated alongside the data. It consumes additional storage. It must be maintained during compaction, recovery, and schema changes.

There is no free index. Every index is a bet that the read performance gain justifies the write overhead. The three major index structures, hash indexes, B-Trees, and LSM-trees, make this bet in different ways. Understanding the mechanics of each one explains why PostgreSQL defaults to B-Trees, why RocksDB uses LSM-trees, and when you should choose one over the other.

Hash Index: The Simplest Upgrade

The append-only file from Chapter 1 stores package tracking events. The linear scan reads every record to find the latest status of one package. The first optimization is obvious: maintain an in-memory hash map that maps each package ID to the byte offset of its latest record in the file.

// Concept: hash index as an in-memory offset map into the append-only log
// Write: append to file, update hash map
// Read: look up offset in hash map, seek to that position, read one record

class HashIndexedStore {
    private final Map<String, Long> index = new HashMap<>();
    private final FileChannel channel;

    void recordEvent(String packageId, String status, Instant ts) throws IOException {
        long offset = channel.position();
        String record = packageId + "\t" + status + "\t" + ts + "\n";
        channel.write(ByteBuffer.wrap(record.getBytes(StandardCharsets.UTF_8)));
        index.put(packageId, offset);  // O(1) index update
    }

    String getLatestStatus(String packageId) throws IOException {
        Long offset = index.get(packageId);  // O(1) lookup
        if (offset == null) return null;
        channel.position(offset);
        // Read the record at that offset
        ByteBuffer buf = ByteBuffer.allocate(256);
        channel.read(buf);
        buf.flip();
        String line = new String(buf.array(), 0, buf.limit(), StandardCharsets.UTF_8)
                .split("\n")[0];
        return line.split("\t")[1];
    }
}

Reads are now O(1). A 10 million record file that took 1.8 seconds to scan now returns in microseconds. The write path adds one hash map put per record, which is negligible.

This is, in simplified form, how Bitcask (the storage engine behind early versions of Riak) works. It is fast, simple, and has two hard limitations.

The index must fit in memory. If the logistics platform tracks 5 million unique packages, the hash map holds 5 million entries. Each entry is a String key (average 12 bytes) plus a Long offset (8 bytes) plus HashMap overhead (roughly 48 bytes per entry). Total: approximately 340 MB. This is fine for 5 million packages. It is not fine for 500 million.

Range queries are impossible. The hash map can answer “what is the status of PKG-40291?” but not “which packages were delivered between 2pm and 3pm?” Hash maps have no ordering. Every range query degrades to a full scan.

B-Tree structure showing root node, internal nodes, and leaf nodes with keys and page references

The B-Tree maintains keys in sorted order across disk pages. Each internal node contains keys and pointers to child pages. Leaf nodes contain keys and pointers to the actual data rows. A lookup for a specific key traverses from root to leaf, reading one page per tree level. For a table with 10 million rows, this is typically 3-4 page reads, compared to the linear scan’s 10 million record reads.

B-Trees: The Disk-Native Sorted Index

B-Trees solve both limitations of the hash index. They store the index on disk (no memory constraint) and maintain keys in sorted order (range queries are efficient). PostgreSQL, MySQL/InnoDB, SQLite, and most relational databases use B-Trees as their default index structure.

A B-Tree organizes data into fixed-size pages (typically 8KB in PostgreSQL, 16KB in InnoDB). Each page is either an internal node containing keys and pointers to child pages, or a leaf node containing keys and pointers to data rows.

A lookup traverses the tree from root to leaf. At each level, a binary search within the page finds which child pointer to follow. The tree is balanced: every leaf is at the same depth. For a table with $n$ rows and a branching factor of $b$ (keys per page), the depth is $\log_b(n)$.

-- Concept: PostgreSQL B-Tree index creation and its observable cost
-- The package_events table has 10 million rows

CREATE INDEX idx_package_events_package_id
ON package_events (package_id);

-- Index creation output:
-- CREATE INDEX
-- Time: 28456.321 ms (28 seconds)

-- The 28 seconds is the price. During index creation, PostgreSQL:
-- 1. Scans every row in the table (sequential scan)
-- 2. Sorts all 10 million package_id values
-- 3. Writes the sorted values into B-Tree pages
-- 4. Flushes the pages to disk

-- After index creation:
EXPLAIN ANALYZE
SELECT status, timestamp
FROM package_events
WHERE package_id = 'PKG-40291'
ORDER BY timestamp DESC
LIMIT 1;

-- Index Scan using idx_package_events_package_id on package_events
--   Index Cond: (package_id = 'PKG-40291')
-- Planning Time: 0.12 ms
-- Execution Time: 0.08 ms

-- From 3842ms (Chapter 1) to 0.08ms. That is the payoff.

The payoff is dramatic: from 3.8 seconds to 0.08 milliseconds. The price is the 28 seconds of index creation, the ongoing write overhead (every INSERT must update both the table and the index), and the storage cost (the index itself is a separate data structure on disk).

Write Amplification in B-Trees

Every write to a B-Tree indexed table requires at least two writes: one to the data page and one to the index page. If the index page is full, it splits into two pages, and the parent page must be updated, which can cascade up the tree.

PostgreSQL’s B-Tree implementation writes each page modification to the WAL before modifying the page. A single INSERT into a table with 3 indexes generates: 1 WAL record for the heap (table) insert, plus 1 WAL record per index insert, plus potential WAL records for page splits. This is write amplification: one logical write becomes multiple physical writes.

LSM-Trees: Write-Optimized at the Cost of Read Complexity

LSM-trees (Log-Structured Merge-trees) take a different approach. Instead of maintaining a sorted structure on disk and updating it in place, they buffer writes in memory and periodically flush sorted runs to disk. RocksDB, LevelDB, Cassandra, and ScyllaDB use LSM-trees.

The write path:

  1. Write arrives. Append it to the WAL (crash recovery, covered in Chapter 3).
  2. Insert the key-value pair into the memtable, an in-memory sorted structure (typically a red-black tree or skip list).
  3. When the memtable exceeds a size threshold (default 64MB in RocksDB), flush it to disk as a sorted, immutable file called an SSTable (Sorted String Table).
  4. SSTables accumulate on disk. A background compaction process merges SSTables, discarding obsolete entries and producing larger sorted runs.

The read path:

  1. Check the memtable (in memory, fast).
  2. Check each SSTable on disk, starting with the most recent. Each SSTable has a Bloom filter that can quickly answer “is this key definitely not in this file?” with no false negatives.
  3. If the Bloom filter says “maybe,” read the SSTable’s index block and binary search for the key.
# Concept: observing RocksDB's LSM-tree structure using ldb tool
# After inserting 5 million package events:

ldb --db=/var/data/tracking-rocksdb manifest_dump

# Output shows levels and SSTable files:
# --- level 0 --- version# 42
# 000038.sst: size 67108864, keys 524288, seq 4718592..5242879
# 000039.sst: size 67108864, keys 524288, seq 5242880..5767167
# --- level 1 ---
# 000035.sst: size 268435456, keys 2097152, seq 0..2097151
# --- level 2 ---
# 000030.sst: size 268435456, keys 2097152, seq 0..2097151
# 000031.sst: size 268435456, keys 2097152, seq 2097152..4194303

# Level 0: recently flushed memtables, may have overlapping key ranges
# Level 1+: compacted, non-overlapping key ranges within each level

The write path is fast because it only writes to memory and occasionally flushes a sorted run to disk, both sequential operations. The read path is slower than a B-Tree for point lookups, because it may need to check multiple SSTables. The compaction process adds background I/O and CPU load.

The Decision Rule

Choose B-Trees (PostgreSQL, MySQL/InnoDB) when your workload is read-heavy and requires fast point lookups and range scans. The write amplification from page splits and index updates is acceptable when reads outnumber writes by 10:1 or more.

Choose LSM-trees (RocksDB, Cassandra) when your workload is write-heavy. The buffered write path and sequential disk I/O make LSM-trees faster for high write throughput. Accept the higher read amplification (checking multiple SSTables) and the background CPU cost of compaction.

The logistics platform’s package tracking service is write-heavy during business hours (thousands of events per minute) and read-heavy on the dashboard (hundreds of status queries per minute). The write/read ratio determines the choice. If the tracking ingest pipeline is the bottleneck, an LSM-tree engine is worth evaluating. If the dashboard query latency is the problem, a B-Tree index on PostgreSQL is the direct fix.

Neither index type eliminates the append-only log. The B-Tree sits alongside a WAL. The LSM-tree is built on top of one. Chapter 3 explains why the log persists in both architectures and what durability guarantees it provides.