Storage Infrastructure: Disk, Filesystem, and RAID
Storage Infrastructure
MongoDB’s performance is ultimately bounded by storage I/O. The WiredTiger cache absorbs read hot spots, but writes (checkpoint flushes, journal commits, compaction) always hit disk. The choice of disk technology, filesystem, and kernel I/O settings determines the floor below which latency cannot go.
Disk Technology Comparison
| Metric | HDD (7200 RPM) | SATA SSD | NVMe SSD |
|---|---|---|---|
| Random read IOPS | 150 | 50,000 | 500,000+ |
| Random write IOPS | 150 | 30,000 | 200,000+ |
| Sequential read | 200 MB/s | 550 MB/s | 3,500 MB/s |
| Sequential write | 200 MB/s | 520 MB/s | 3,000 MB/s |
| Latency (random 4K read) | 8-12ms | 0.1ms | 0.02ms |
For the telemetry platform at 50,000 writes/second, each write generates approximately 2-3 I/O operations (data + index + journal). Total I/O requirement: 100,000-150,000 IOPS.
- HDD: Cannot sustain 150,000 IOPS. Maximum sustained: 150 IOPS per drive. Would need 1,000 drives in RAID.
- SATA SSD: 30,000 write IOPS per drive. Would need 5 drives.
- NVMe: 200,000+ write IOPS per drive. A single drive is sufficient.
# Benchmark disk I/O with fio (run before deploying MongoDB)
fio --name=random-write --rw=randwrite --bs=4k --size=4G \
--numjobs=16 --iodepth=32 --runtime=60 --group_reporting
# Expected output for NVMe:
# write: IOPS=280k, BW=1094MiB/s, avg latency=1.8us
# Expected output for gp3 (3000 IOPS):
# write: IOPS=3000, BW=11.7MiB/s, avg latency=5.3ms
Filesystem Selection
MongoDB recommends XFS. The performance difference against ext4 is measurable:
| Workload | XFS | ext4 | Difference |
|---|---|---|---|
| Insert throughput (50K docs/s) | 52,000/s | 45,000/s | +15% |
| Checkpoint duration (1.5 GB flush) | 4.2s | 5.8s | -28% |
| Compaction speed | 180 MB/s | 140 MB/s | +29% |
| fallocate (journal pre-allocation) | 0.01s | 0.4s | 40x faster |
XFS advantages for MongoDB:
- Extent-based allocation: Large contiguous writes (checkpoints) are more efficient.
- Concurrent I/O: XFS handles parallel I/O from multiple WiredTiger threads better.
- fallocate support: Journal file pre-allocation is near-instant on XFS vs seconds on ext4.
- No double journaling: Mount with
noatime,noexec,nodevfor best performance.
# Format and mount XFS for MongoDB
mkfs.xfs /dev/nvme1n1
mount -o noatime,noexec,nodev /dev/nvme1n1 /data/db
readahead Tuning
Linux reads ahead of the current position to pre-fetch data that will likely be needed next. The default readahead is 128 KB (256 sectors). For MongoDB, this is too high for random read workloads and wastes I/O bandwidth.
# Check current readahead
blockdev --getra /dev/nvme1n1
# 256 (sectors) = 128 KB
# Set readahead to 16 KB (32 sectors) for MongoDB
blockdev --setra 32 /dev/nvme1n1
# Make persistent via udev rule
echo 'ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/read_ahead_kb}="16"' \
> /etc/udev/rules.d/99-mongodb.rules
MongoDB’s WiredTiger manages its own read patterns. Large readahead wastes I/O on data that WiredTiger will not use (because it reads specific pages, not sequential ranges). Setting readahead to 8-32 KB reduces wasted I/O.