Diagnosing Cache Pressure and Application Thread Eviction
Diagnosing Cache Pressure and Application Thread Eviction
The Symptom
The telemetry platform’s p99 write latency shows periodic spikes to 200ms every 60 seconds. The spikes last 2-5 seconds. Between spikes, p99 is a stable 15ms. The pattern is clock-like in its regularity.
The Cause
The 60-second periodicity matches the checkpoint interval. During each checkpoint, WiredTiger writes dirty pages to disk. The I/O burst from checkpointing competes with normal operations. If the dirty data volume is large enough, the eviction system cannot keep up, and application threads are drafted into eviction duty.
Checking the metrics:
// Capture cache metrics over 5 minutes
var start = db.serverStatus().wiredTiger.cache;
sleep(300000);
var end = db.serverStatus().wiredTiger.cache;
print("App thread evictions: " + (end["pages evicted by application threads"] - start["pages evicted by application threads"]));
print("Dirty bytes range: " + start["tracked dirty bytes in the cache"] + " -> " + end["tracked dirty bytes in the cache"]);
print("Cache bytes: " + end["bytes currently in the cache"] + " / " + end["maximum bytes configured"]);
Output:
App thread evictions: 2300
Dirty bytes range: 45000000 -> 850000000
Cache bytes: 14800000000 / 15500000000 (95.5%)
Cache utilization at 95.5% means the eviction_trigger threshold is being hit. 2,300 app thread evictions in 5 minutes means application operations are stalling 7.6 times per second on average.
The Benchmark
// k6 test measuring latency correlation with checkpoints
import http from 'k6/http';
import { Trend } from 'k6/metrics';
const writeLatency = new Trend('write_latency', true);
export const options = {
scenarios: {
steady_writes: {
executor: 'constant-arrival-rate',
rate: 2000,
timeUnit: '1s',
duration: '5m',
preAllocatedVUs: 100,
maxVUs: 200,
},
},
};
export default function() {
const startTime = Date.now();
const res = http.post(`${__ENV.BASE_URL}/api/telemetry/ingest`, JSON.stringify({
sensorId: `sensor-${String(Math.floor(Math.random() * 10000)).padStart(5, '0')}`,
timestamp: new Date().toISOString(),
temperature: 20 + Math.random() * 15,
humidity: 40 + Math.random() * 30,
}), { headers: { 'Content-Type': 'application/json' } });
writeLatency.add(Date.now() - startTime);
}
Results with 15.5 GB WiredTiger cache and a working set of 18 GB:
| Time window | p50 | p95 | p99 | App thread evictions/sec |
|---|---|---|---|---|
| 0-10s (post-checkpoint) | 3ms | 8ms | 18ms | 0 |
| 10-40s (normal) | 3ms | 9ms | 20ms | 0.5 |
| 40-55s (dirty accumulation) | 4ms | 15ms | 55ms | 3.2 |
| 55-65s (checkpoint + eviction) | 8ms | 45ms | 200ms | 12.8 |
The Fix
Two adjustments:
1. Size the cache to fit the working set.
The working set is the data actively accessed by queries. For the telemetry platform, this is the last 24 hours of readings plus all indexes. Calculate it:
// Working set estimation
var readingsLast24h = db.readings.stats().avgObjSize *
db.readings.countDocuments({ ts: { $gte: new Date(Date.now() - 86400000) } });
var totalIndexSize = db.readings.stats().totalIndexSize;
print("Working set: " + (readingsLast24h + totalIndexSize) / (1024*1024*1024) + " GB");
If the working set is 18 GB and the cache is 15.5 GB, increase the cache. On a 48 GB server:
# mongod.conf
storage:
wiredTiger:
engineConfig:
cacheSizeGB: 24
2. Tune eviction thresholds for write-heavy workloads.
# mongod.conf - adjusted eviction thresholds
storage:
wiredTiger:
engineConfig:
configString: "eviction_dirty_target=2,eviction_dirty_trigger=10,eviction=(threads_min=4,threads_max=8)"
Lowering eviction_dirty_target from 5% to 2% starts background dirty page eviction earlier, spreading the checkpoint I/O over time instead of bursting. Increasing threads_min from 1 to 4 provides more background eviction capacity.
The Proof
After increasing cache to 24 GB and tuning eviction:
| Metric | Before (15.5 GB cache) | After (24 GB, tuned eviction) |
|---|---|---|
| Cache utilization | 95.5% | 75% |
| App thread evictions/sec (peak) | 12.8 | 0.1 |
| Write p99 during checkpoint | 200ms | 25ms |
| Write p99 between checkpoints | 20ms | 15ms |
| Dirty bytes at checkpoint time | 850 MB | 120 MB |
The Trade-off
Allocating 24 GB to WiredTiger cache leaves 24 GB for the operating system, filesystem cache, connections, and applications. If the server runs other processes (monitoring agents, log collectors), available memory drops further. On containerized deployments, the WiredTiger cache must be sized explicitly to stay within the container’s memory limit (covered in CH22).
Lowering eviction_dirty_target to 2% means background eviction runs more frequently, consuming CPU cycles. On a 4-core server, continuous background eviction may compete with query processing. On a 16-core server with ample CPU, the impact is negligible.