LSM-Trees, SSTables, and Why RocksDB Chose Differently
LSM-Trees, SSTables, and Why RocksDB Chose Differently
The Black Box
RocksDB is an embedded key-value store. You put a key-value pair, and it is stored. You get a key, and the value comes back. The write is fast. The read is fast enough. Under heavy write load, RocksDB outperforms B-Tree databases. Under read-heavy point lookups, B-Trees are faster. These are not magic properties. They are direct consequences of how LSM-trees organize data on disk.
The Mechanism
An LSM-tree has two regions: an in-memory buffer called the memtable, and a hierarchy of immutable sorted files on disk called SSTables (Sorted String Tables).
The Write Path
- A write arrives:
PUT("PKG-40291", "{status: DELIVERED, ts: 2024-11-15T14:22:00Z}"). - The key-value pair is appended to the WAL (crash safety, covered in Chapter 3).
- The pair is inserted into the memtable, a sorted in-memory data structure. RocksDB uses a skip list. Insertions are $O(\log n)$.
- When the memtable reaches its size threshold (default 64MB in RocksDB), it becomes immutable, and a new empty memtable takes its place.
- A background thread flushes the immutable memtable to disk as an SSTable: a sorted, compressed, immutable file.
The write path never reads from disk. It never updates existing files. It only appends to the WAL and inserts into an in-memory structure. This is why LSM-trees handle high write throughput.
The SSTable Format
Each SSTable is divided into blocks (default 4KB in RocksDB):
- Data blocks: sorted key-value pairs, optionally compressed with Snappy, LZ4, or Zstd.
- Index block: maps the first key of each data block to its offset. A lookup binary-searches the index block, then reads the single data block containing the target key.
- Bloom filter block: a probabilistic data structure that answers “is this key in this SSTable?” with no false negatives and a configurable false positive rate (default 1% with 10 bits per key).
The Read Path
To read PKG-40291:
- Check the active memtable. If found, return immediately.
- Check the immutable memtable (if one exists, waiting to be flushed). If found, return.
- For each SSTable, starting with the most recently created: a. Check the Bloom filter. If it says “definitely not here,” skip this file. b. If the Bloom filter says “maybe here,” binary search the index block for the target key’s data block. c. Read and decompress the data block. Search for the key within the block.
- Return the first match found (most recent SSTable wins).
The worst case for a read is checking the Bloom filter of every SSTable before finding the key in the oldest one. This is read amplification.
// Concept: Bloom filter probability
// With 10 bits per key and 7 hash functions (RocksDB default),
// false positive rate is approximately 0.8%
//
// For 100 SSTables with 1M keys each:
// - Bloom filter check: ~100 * 0.008 = 0.8 unnecessary disk reads on average
// - Without Bloom filter: up to 100 disk reads per lookup
//
// Bloom filter memory cost: 10 bits * 100M keys = 125 MB
// This is the memory price of avoiding 99.2% of unnecessary disk reads.
Compaction: The Background Tax
SSTables accumulate as memtables flush. Without compaction, the number of SSTables grows indefinitely, and read performance degrades because more files must be checked.
Compaction merges multiple SSTables into fewer, larger ones, discarding deleted keys and old versions of updated keys. Two strategies dominate.
Size-tiered compaction (Cassandra default): SSTables of similar size are grouped into tiers. When a tier accumulates enough files, they are merged into a single file in the next tier. Write amplification is low (each record is rewritten $O(\log n)$ times across tiers), but space amplification is high because obsolete entries persist across multiple tiers until merged.
Leveled compaction (RocksDB default): SSTables are organized into levels. Level 0 contains recently flushed memtables with overlapping key ranges. Levels 1 and above contain non-overlapping SSTables. When level $L$ exceeds its size limit, one SSTable from level $L$ is merged with all overlapping SSTables in level $L+1$. Write amplification is higher (each record may be rewritten up to $L \times \text{fanout}$ times), but read amplification is lower because each level has non-overlapping key ranges, so at most one SSTable per level needs to be checked.
# Concept: RocksDB compaction statistics via LOG file
# After running for 24 hours with the logistics platform's write load:
grep "compaction_stats" /var/data/tracking-rocksdb/LOG | tail -20
# Level Files Size Score Read(GB) Write(GB) Rn(GB) Rnp1(GB) W-Amp
# L0 3 192MB 0.8 0.0 0.6 0.0 0.0 1.0
# L1 10 640MB 1.2 1.8 1.7 0.6 1.2 2.8
# L2 98 6.1GB 1.0 8.4 8.2 1.7 6.5 4.8
# L3 980 61GB 0.9 0.0 0.0 0.0 0.0 0.0
# Sum 1091 68GB - 10.2 10.5 - - -
#
# W-Amp (write amplification per level):
# L0: 1.0 (flush only)
# L1: 2.8 (each byte rewritten ~3 times during compaction)
# L2: 4.8 (each byte rewritten ~5 times)
# Total write amplification across all levels: ~8.6x
The Observable Consequence
The three-way tradeoff in LSM-trees:
| Metric | Leveled Compaction | Size-Tiered Compaction |
|---|---|---|
| Write amplification | Higher (8-15x) | Lower (3-5x) |
| Read amplification | Lower (1 file per level) | Higher (multiple files per tier) |
| Space amplification | Lower (10-15%) | Higher (50-100%) |
For the logistics platform’s package tracking store with 50 million events:
- Leveled compaction (RocksDB default): ingestion rate is limited by compaction throughput. If compaction cannot keep up, Level 0 accumulates files and reads slow down. Suitable when the read latency SLA is strict (sub-millisecond point lookups).
- Size-tiered compaction: ingestion rate is higher because writes are rewritten fewer times, but the storage footprint is larger and reads must check more files. Suitable when write throughput is the bottleneck and read latency is more relaxed.
The Decision Rule
If your workload has a write/read ratio above 5:1 and storage cost is not the primary constraint, size-tiered compaction maximizes write throughput. If reads must be fast and predictable (sub-millisecond p99 for point lookups), leveled compaction pays the write amplification cost to keep reads bounded.
If your workload is read-heavy with occasional bulk imports, a B-Tree (PostgreSQL) is the simpler choice. The B-Tree’s write amplification from page splits is lower than the LSM-tree’s compaction overhead for moderate write rates, and point lookups are a single tree traversal rather than a multi-level search.
The append-only log from Chapter 1 is visible inside both designs. The B-Tree’s WAL is an append-only log of page modifications. The LSM-tree’s WAL is an append-only log of key-value writes. The memtable flush produces a sorted version of the log. Compaction merges sorted logs. The fundamental pattern, append first and sort later, traces directly back to the file from Chapter 1.