Storage Infrastructure

MongoDB’s performance is ultimately bounded by storage I/O. The WiredTiger cache absorbs read hot spots, but writes (checkpoint flushes, journal commits, compaction) always hit disk. The choice of disk technology, filesystem, and kernel I/O settings determines the floor below which latency cannot go.

Storage stack diagram. Shows application -> MongoDB -> WiredTiger -> filesystem (XFS/ext4) -> I/O scheduler -> disk hardware (HDD/SSD/NVMe). Marks latency at each layer. Shows random vs sequential I/O patterns at each layer. Highlights that checkpoint writes are sequential and journal writes are sequential but latency-sensitive.

Disk Technology Comparison

Metric	HDD (7200 RPM)	SATA SSD	NVMe SSD
Random read IOPS	150	50,000	500,000+
Random write IOPS	150	30,000	200,000+
Sequential read	200 MB/s	550 MB/s	3,500 MB/s
Sequential write	200 MB/s	520 MB/s	3,000 MB/s
Latency (random 4K read)	8-12ms	0.1ms	0.02ms

For the telemetry platform at 50,000 writes/second, each write generates approximately 2-3 I/O operations (data + index + journal). Total I/O requirement: 100,000-150,000 IOPS.

HDD: Cannot sustain 150,000 IOPS. Maximum sustained: 150 IOPS per drive. Would need 1,000 drives in RAID.
SATA SSD: 30,000 write IOPS per drive. Would need 5 drives.
NVMe: 200,000+ write IOPS per drive. A single drive is sufficient.

# Benchmark disk I/O with fio (run before deploying MongoDB)
fio --name=random-write --rw=randwrite --bs=4k --size=4G \
    --numjobs=16 --iodepth=32 --runtime=60 --group_reporting

# Expected output for NVMe:
# write: IOPS=280k, BW=1094MiB/s, avg latency=1.8us

# Expected output for gp3 (3000 IOPS):
# write: IOPS=3000, BW=11.7MiB/s, avg latency=5.3ms

Filesystem Selection

MongoDB recommends XFS. The performance difference against ext4 is measurable:

Workload	XFS	ext4	Difference
Insert throughput (50K docs/s)	52,000/s	45,000/s	+15%
Checkpoint duration (1.5 GB flush)	4.2s	5.8s	-28%
Compaction speed	180 MB/s	140 MB/s	+29%
fallocate (journal pre-allocation)	0.01s	0.4s	40x faster

XFS advantages for MongoDB:

Extent-based allocation: Large contiguous writes (checkpoints) are more efficient.
Concurrent I/O: XFS handles parallel I/O from multiple WiredTiger threads better.
fallocate support: Journal file pre-allocation is near-instant on XFS vs seconds on ext4.
No double journaling: Mount with noatime,noexec,nodev for best performance.

# Format and mount XFS for MongoDB
mkfs.xfs /dev/nvme1n1
mount -o noatime,noexec,nodev /dev/nvme1n1 /data/db

readahead Tuning

Linux reads ahead of the current position to pre-fetch data that will likely be needed next. The default readahead is 128 KB (256 sectors). For MongoDB, this is too high for random read workloads and wastes I/O bandwidth.

# Check current readahead
blockdev --getra /dev/nvme1n1
# 256 (sectors) = 128 KB

# Set readahead to 16 KB (32 sectors) for MongoDB
blockdev --setra 32 /dev/nvme1n1

# Make persistent via udev rule
echo 'ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/read_ahead_kb}="16"' \
  > /etc/udev/rules.d/99-mongodb.rules

MongoDB’s WiredTiger manages its own read patterns. Large readahead wastes I/O on data that WiredTiger will not use (because it reads specific pages, not sequential ranges). Setting readahead to 8-32 KB reduces wasted I/O.