Checkpoint Tuning and Write-Ahead Log Sizing
Checkpoint Tuning and Write-Ahead Log Sizing
The Symptom
After tuning the WiredTiger cache and eviction thresholds in CH13-S1, the 60-second latency spikes are reduced but not eliminated. The remaining spikes correlate with disk I/O bursts visible in iostat:
Device r/s w/s rMB/s wMB/s await %util
nvme0n1 120.0 2400.0 8.5 280.0 2.1 85.0
During the checkpoint window, write throughput spikes to 280 MB/s and disk utilization hits 85%. The 2,400 write operations per second during checkpoint compete with the 120 read operations.
The Cause
The default checkpoint interval is 60 seconds. At 2,000 writes/sec with 340-byte documents plus index updates, approximately 80 MB of dirty data accumulates per checkpoint interval. The checkpoint writes all 80 MB in a burst of 2-3 seconds. On a single NVMe drive with 500 MB/s sequential write throughput, 80 MB takes 0.16 seconds. But the writes are not sequential: they are scattered across multiple B-tree data files and index files, resulting in semi-random I/O patterns.
The journal (write-ahead log) also contributes. WiredTiger journals every write operation before the checkpoint. The journal files grow between checkpoints and are trimmed after a successful checkpoint. The default journalCompressor is snappy, and the default commitIntervalMs is 100ms (50ms with j:true write concern).
The Benchmark
Compare checkpoint behavior at different intervals:
// Monitor checkpoint duration and I/O
// Run this during a k6 load test at 2,000 writes/sec
// Checkpoint metrics
db.serverStatus().wiredTiger.transaction["transaction checkpoint currently running"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint most recent time (msecs)"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint max time (msecs)"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint total time (msecs)"]
Results at different checkpoint intervals:
| Checkpoint interval | Dirty data per checkpoint | Checkpoint duration | Write p99 during checkpoint | Recovery time |
|---|---|---|---|---|
| 30s | 40 MB | 0.8s | 45ms | 30s max |
| 60s (default) | 80 MB | 1.8s | 85ms | 60s max |
| 120s | 160 MB | 3.5s | 150ms | 120s max |
| 300s | 400 MB | 8.2s | 280ms | 300s max |
The Fix
For the telemetry platform’s write rate, reduce the checkpoint interval to 30 seconds:
# mongod.conf
storage:
wiredTiger:
engineConfig:
configString: "checkpoint=(wait=30)"
This halves the dirty data accumulated per checkpoint, halving the I/O burst and its latency impact. The trade-off is that checkpoints occur twice as often, consuming more total I/O but in smaller, less disruptive bursts.
Tune the journal commit interval to balance durability and throughput:
# mongod.conf
storage:
journal:
commitIntervalMs: 100 # Default: 100ms (50ms with j:true)
For the telemetry platform where individual readings are not critical (a few lost readings are acceptable), keep the default 100ms. This means up to 100ms of writes can be lost on a crash. The journal flushes every 100ms, grouping all writes in that interval into a single disk sync.
For financial data or audit logs, set commitIntervalMs: 10 for near-real-time durability:
// Critical writes: journal acknowledged
collection.withWriteConcern(WriteConcern.JOURNALED)
.insertOne(auditDocument);
The Proof
After reducing checkpoint interval to 30 seconds:
| Metric | 60s checkpoint | 30s checkpoint |
|---|---|---|
| Dirty data per checkpoint | 80 MB | 40 MB |
| Checkpoint duration | 1.8s | 0.8s |
| Write p99 during checkpoint | 85ms | 45ms |
| Write p99 between checkpoints | 15ms | 15ms |
| Total checkpoint I/O per hour | 4.8 GB | 4.8 GB |
| Checkpoints per hour | 60 | 120 |
| Max recovery time | 60s | 30s |
Total checkpoint I/O per hour is the same (4.8 GB). The work is the same; it is just distributed in smaller batches.
The Trade-off
Shorter checkpoint intervals mean faster recovery after an unclean shutdown: MongoDB only needs to replay journal entries since the last checkpoint. At 30 seconds, recovery replays at most 30 seconds of writes. At 300 seconds, it replays 5 minutes.
But shorter intervals increase the metadata overhead. Each checkpoint updates the root page of every B-tree (every collection and index). With 50 collections and 200 indexes, that is 250 root page writes per checkpoint. At 120 checkpoints per hour (30s interval), that is 30,000 root page writes per hour. On SSD, this is negligible. On HDD, the seek overhead accumulates.
Journal sizing also matters. WiredTiger pre-allocates journal files in 100 MB chunks. With 2,000 writes/sec, journal throughput is approximately 2 MB/sec (after snappy compression). Each 100 MB journal file fills in 50 seconds. Journal files older than the last checkpoint are deleted. With a 30-second checkpoint interval, only 1-2 journal files exist at any time (100-200 MB). With a 300-second interval, 6-7 files exist (600-700 MB). On storage-constrained deployments, longer checkpoint intervals consume more journal disk space.