Checkpoint Tuning and Write-Ahead Log Sizing

The Symptom

After tuning the WiredTiger cache and eviction thresholds in CH13-S1, the 60-second latency spikes are reduced but not eliminated. The remaining spikes correlate with disk I/O bursts visible in iostat:

Device            r/s      w/s    rMB/s    wMB/s  await  %util
nvme0n1         120.0  2400.0      8.5    280.0    2.1   85.0

During the checkpoint window, write throughput spikes to 280 MB/s and disk utilization hits 85%. The 2,400 write operations per second during checkpoint compete with the 120 read operations.

The Cause

The default checkpoint interval is 60 seconds. At 2,000 writes/sec with 340-byte documents plus index updates, approximately 80 MB of dirty data accumulates per checkpoint interval. The checkpoint writes all 80 MB in a burst of 2-3 seconds. On a single NVMe drive with 500 MB/s sequential write throughput, 80 MB takes 0.16 seconds. But the writes are not sequential: they are scattered across multiple B-tree data files and index files, resulting in semi-random I/O patterns.

The journal (write-ahead log) also contributes. WiredTiger journals every write operation before the checkpoint. The journal files grow between checkpoints and are trimmed after a successful checkpoint. The default journalCompressor is snappy, and the default commitIntervalMs is 100ms (50ms with j:true write concern).

The Benchmark

Compare checkpoint behavior at different intervals:

// Monitor checkpoint duration and I/O
// Run this during a k6 load test at 2,000 writes/sec

// Checkpoint metrics
db.serverStatus().wiredTiger.transaction["transaction checkpoint currently running"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint most recent time (msecs)"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint max time (msecs)"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint total time (msecs)"]

Results at different checkpoint intervals:

Checkpoint interval	Dirty data per checkpoint	Checkpoint duration	Write p99 during checkpoint	Recovery time
30s	40 MB	0.8s	45ms	30s max
60s (default)	80 MB	1.8s	85ms	60s max
120s	160 MB	3.5s	150ms	120s max
300s	400 MB	8.2s	280ms	300s max

The Fix

For the telemetry platform’s write rate, reduce the checkpoint interval to 30 seconds:

# mongod.conf
storage:
  wiredTiger:
    engineConfig:
      configString: "checkpoint=(wait=30)"

This halves the dirty data accumulated per checkpoint, halving the I/O burst and its latency impact. The trade-off is that checkpoints occur twice as often, consuming more total I/O but in smaller, less disruptive bursts.

Tune the journal commit interval to balance durability and throughput:

# mongod.conf
storage:
  journal:
    commitIntervalMs: 100    # Default: 100ms (50ms with j:true)

For the telemetry platform where individual readings are not critical (a few lost readings are acceptable), keep the default 100ms. This means up to 100ms of writes can be lost on a crash. The journal flushes every 100ms, grouping all writes in that interval into a single disk sync.

For financial data or audit logs, set commitIntervalMs: 10 for near-real-time durability:

// Critical writes: journal acknowledged
collection.withWriteConcern(WriteConcern.JOURNALED)
    .insertOne(auditDocument);

The Proof

After reducing checkpoint interval to 30 seconds:

Metric	60s checkpoint	30s checkpoint
Dirty data per checkpoint	80 MB	40 MB
Checkpoint duration	1.8s	0.8s
Write p99 during checkpoint	85ms	45ms
Write p99 between checkpoints	15ms	15ms
Total checkpoint I/O per hour	4.8 GB	4.8 GB
Checkpoints per hour	60	120
Max recovery time	60s	30s

Total checkpoint I/O per hour is the same (4.8 GB). The work is the same; it is just distributed in smaller batches.

The Trade-off

Shorter checkpoint intervals mean faster recovery after an unclean shutdown: MongoDB only needs to replay journal entries since the last checkpoint. At 30 seconds, recovery replays at most 30 seconds of writes. At 300 seconds, it replays 5 minutes.

But shorter intervals increase the metadata overhead. Each checkpoint updates the root page of every B-tree (every collection and index). With 50 collections and 200 indexes, that is 250 root page writes per checkpoint. At 120 checkpoints per hour (30s interval), that is 30,000 root page writes per hour. On SSD, this is negligible. On HDD, the seek overhead accumulates.

Journal sizing also matters. WiredTiger pre-allocates journal files in 100 MB chunks. With 2,000 writes/sec, journal throughput is approximately 2 MB/sec (after snappy compression). Each 100 MB journal file fills in 50 seconds. Journal files older than the last checkpoint are deleted. With a 30-second checkpoint interval, only 1-2 journal files exist at any time (100-200 MB). With a 300-second interval, 6-7 files exist (600-700 MB). On storage-constrained deployments, longer checkpoint intervals consume more journal disk space.