Skip to main content
data systems mechanics invariants in distributed architectures

B-Trees: The Read-Optimized Mutable Standard

4 min read Chapter 3 of 28
Summary

This section establishes the fundamental dichotomy between mutable,...

This section establishes the fundamental dichotomy between mutable, page-oriented storage (B-Trees) and immutable, log-structured storage (LSM-Trees). It explains the core mechanics of update-in-place versus out-of-place appends, directly linking these to their resulting I/O patterns: random writes for B-Trees and sequential writes for LSM-Trees. The analysis details the fragmentation trade-off—internal and external for B-Trees versus storage amplification from compaction for LSM-Trees—and underscores the critical role of the Write-Ahead Log (WAL) in enabling durable in-place updates. A comparative table synthesizes these characteristics, and contrasting pseudocode illustrates the procedural differences. The narrative frames these as immutable trade-offs, optimized for read-heavy versus write-heavy workloads respectively.

B-Trees: The Read-Optimized Mutable Standard

B-Trees are the canonical example of a mutable, page-oriented storage structure. They maintain a balanced, searchable tree where each key resides in exactly one node, ensuring low read amplification. Updates occur in-place, with the database managing a buffer pool of fixed-size pages cached in RAM. A B-Tree update involves reading the target page into the buffer pool, modifying it in memory, and eventually writing the dirty page back to its exact same location on disk.

The update-in-place mechanism in B-Trees creates random write I/O patterns, as the modified page can be anywhere on disk. Random writes are significantly slower than sequential writes on magnetic hard disk drives (HDDs), making B-Trees less optimal for write-heavy workloads. However, for read-heavy workloads, B-Trees often have lower latency because the key is found in a single, predictable location.

Fragmentation in B-Trees

Fragmentation in B-Trees occurs internally (within partially filled pages) and externally (free space gaps from deleted pages). Internal fragmentation wastes space inside pages, while external fragmentation makes it difficult to allocate large contiguous regions, though this is less critical for SSDs. The use of a Write-Ahead Log (WAL) is critical for B-Trees to guarantee durability despite update-in-place mechanics.

Log-Structured Storage: A Contrasting Approach

Log-structured storage, as used in LSM-Trees, avoids update-in-place by appending all new data (including updates and deletes) to the end of a log file. This approach converts random application writes into sequential disk writes during memtable flushes and compaction processes. LSM-Trees do not suffer from update-in-place fragmentation but have storage amplification due to data duplication across SSTable levels during compaction.

Comparative Analysis

The following table contrasts the core mechanics, I/O patterns, and trade-offs between page-oriented (B-Tree) and log-structured (LSM-Tree) storage models.

CharacteristicPage-Oriented Storage (B-Tree)Log-Structured Storage (LSM-Tree)
Update ModelIn-place updateOut-of-place append (immutable)
Primary Write I/O PatternRandom (modify page at location X)Sequential (append to log/SSTable)
Primary Read I/O PatternRandom (seek to page location)Random (seek within SSTable files)
FragmentationInternal (partially full pages) & External (free space gaps)None from updates; space overhead from duplicates during compaction
Crash Recovery MechanismWrite-Ahead Log (WAL) mandatorySeparate WAL for memtable only; SSTables are immutable once written
AmplificationWrite Amplification: Moderate-High (page splits/merges). Read Amplification: Low. Storage Amplification: Low.Write Amplification: High (compaction). Read Amplification: Moderate-High (multiple levels). Storage Amplification: Moderate-High (temporary duplicates).
Optimized WorkloadRead-heavy, point reads, workloads needing low latencyWrite-heavy, bulk inserts, append-intensive workloads
Example SystemsTraditional RDBMS (PostgreSQL, MySQL/InnoDB), B-Tree file systemsKey-Value Stores (RocksDB, LevelDB, Cassandra, ScyllaDB)

Code Illustrations

The following pseudocode illustrates the B-Tree insert operation with Write-Ahead Logging, contrasting with the LSM-Tree write path.

function btree_insert_with_wal(BTree* tree, Key key, Value value, WAL* log) {
    // Find the target leaf page
    Page* leaf_page = find_leaf_page(tree, key);
    
    // Prepare log record
    WALRecord rec = create_wal_record(INSERT_OP, leaf_page->page_id, key, value);
    
    // Write log record to stable storage
    log_append_and_flush(log, rec);
    
    // Apply change in memory
    page_insert_key_value(leaf_page, key, value);
    
    // Handle page overflow
    if (page_is_overfull(leaf_page)) {
        btree_split_leaf_with_wal(tree, leaf_page, log);
    }
}

function lsm_write(LSMTree* tree, Key key, Value value) {
    // Optional: Write to a separate WAL for memtable durability
    // Update the in-memory memtable
    memtable_insert(tree->active_memtable, key, value);
    
    // Flush memtable to SSTable when full
    if (memtable_size(tree->active_memtable) > threshold) {
        flush_memtable_to_sstable(tree->active_memtable);
    }
}

The provided code snippets demonstrate the fundamental difference in update mechanics between B-Trees and LSM-Trees, highlighting the trade-offs in I/O patterns, fragmentation, and amplification factors.

Sources

This section has referenced various concepts and technologies related to B-Trees and LSM-Trees. For further reading and implementation details, please refer to the cited sources and supplementary materials.