Compaction Strategies and Their Observable Cost
Compaction Strategies and Their Observable Cost
The Black Box
RocksDB handles thousands of writes per second without complaint. Then, seemingly at random, write latency spikes from microseconds to hundreds of milliseconds. The application has not changed. The hardware is not failing. RocksDB is compacting, and the compaction is competing with the write path for I/O and CPU.
The Mechanism
Compaction in an LSM-tree serves the same purpose as checkpoints in PostgreSQL: it bounds the growth of accumulated state. Without PostgreSQL checkpoints, the WAL grows forever. Without LSM-tree compaction, the number of SSTables grows forever, and reads degrade because more files must be searched.
Compaction merges multiple SSTables into fewer, larger ones. During a merge:
- Read all key-value pairs from the input SSTables.
- Merge-sort them by key.
- For duplicate keys, keep only the most recent version.
- Write the merged, deduplicated result to a new SSTable.
- Delete the input SSTables.
The I/O cost is the sum of all input SSTables read plus the output SSTable written. For a leveled compaction that merges one L1 SSTable (64MB) with 10 overlapping L2 SSTables (640MB total), the compaction reads 704MB and writes approximately 704MB. That is 1.4GB of I/O for a single compaction operation.
Write Stalls
RocksDB limits the number of Level 0 SSTables (default: level0_slowdown_writes_trigger = 20, level0_stop_writes_trigger = 36). When Level 0 files accumulate because compaction cannot keep up:
- At 20 L0 files: RocksDB throttles writes, artificially adding latency to each write.
- At 36 L0 files: RocksDB stops accepting writes entirely until compaction clears the backlog.
# Concept: detecting write stalls in RocksDB logs
grep -i "stall" /var/data/tracking-rocksdb/LOG
# 2024-11-15T14:22:38.123456 [WARN] Stalling writes because we have 22 level-0 files
# rate_limiter: 16 MB/s
# 2024-11-15T14:22:41.456789 [WARN] Stopping writes because we have 36 level-0 files
# 2024-11-15T14:22:58.789012 [INFO] Resuming writes after compaction cleared level-0 to 12 files
The write stall lasted 20 seconds. During that window, every write to the logistics platform’s package tracking store either timed out or blocked. From the application’s perspective, the database stopped.
The Observable Consequence
Compaction has three costs that the application observes:
I/O bandwidth. Compaction reads and writes SSTables. On a machine with a single SSD shared between the database and the application, compaction I/O competes with read queries. A compaction reading 1.4GB at 400MB/s takes 3.5 seconds, during which query I/O throughput is halved.
CPU. Compaction decompresses input blocks, merge-sorts keys, and recompresses output blocks. With LZ4 compression (the default), compression overhead is low. With Zstd (better ratio), CPU usage during compaction increases by 30-50%.
Space. During compaction, both the input and output SSTables exist on disk simultaneously. The temporary space requirement is approximately equal to the size of the input SSTables. A system using 100GB of data needs at least 10-15GB of free space for compaction to proceed.
// Concept: RocksDB rate limiter to bound compaction I/O impact
// Prevent compaction from saturating the disk and starving reads
Options options = new Options();
// Limit compaction I/O to 100 MB/s
// Total NVMe throughput is 3200 MB/s
// This leaves 3100 MB/s for application reads and writes
RateLimiter rateLimiter = new RateLimiter(
100 * 1024 * 1024, // 100 MB/s
100_000, // refill period: 100ms
10 // fairness factor
);
options.setRateLimiter(rateLimiter);
Kafka Log Compaction
Kafka uses the term “compaction” differently. Kafka log compaction retains only the latest message for each key in a topic, discarding older messages with the same key.
This is not the same as LSM-tree compaction. LSM-tree compaction merges sorted files and discards obsolete entries within the storage engine. Kafka log compaction is a topic-level retention policy that removes old messages from the partition log.
# Concept: Kafka log compaction configuration
# The package-events topic retains only the latest event per package ID
kafka-topics.sh --alter --topic package-tracking-compact \
--config cleanup.policy=compact \
--config min.cleanable.dirty.ratio=0.5 \
--config delete.retention.ms=86400000
# cleanup.policy=compact: keep only latest value per key
# min.cleanable.dirty.ratio=0.5: compact when 50% of log is "dirty" (has duplicates)
# delete.retention.ms: keep delete tombstones for 24 hours before removal
# Before compaction:
# offset 0: PKG-001 -> SCANNED
# offset 1: PKG-002 -> SCANNED
# offset 2: PKG-001 -> IN_TRANSIT
# offset 3: PKG-001 -> DELIVERED
# After compaction:
# offset 1: PKG-002 -> SCANNED
# offset 3: PKG-001 -> DELIVERED
The use case for Kafka log compaction is maintaining a changelog. The logistics platform’s package-tracking-compact topic acts as a materialized view of the latest status per package. New consumers can read the compacted topic to bootstrap their state without replaying the entire event history.
The Decision Rule
If RocksDB write stalls appear in your logs, either increase level0_slowdown_writes_trigger to tolerate more L0 files (accepting higher read latency), increase the compaction rate limiter to let compaction finish faster (accepting I/O contention), or provision faster storage.
If Kafka partition size is growing unboundedly and consumers only need the latest value per key, enable log compaction on the topic. If consumers need the full event history (audit trail, event sourcing), use cleanup.policy=delete with a time-based retention instead.
The fundamental parallel: PostgreSQL checkpoints, RocksDB compaction, and Kafka log compaction all exist for the same reason. The append-only write path from Chapter 1 creates unbounded state. Some background process must periodically reconcile, merge, or discard old data to keep the system operable. The mechanism differs. The motivation is identical.