Skip to main content
unbound mongodb at scale

Storage Infrastructure: Disk, Filesystem, and RAID

3 min read Chapter 67 of 72

Storage Infrastructure

MongoDB’s performance is ultimately bounded by storage I/O. The WiredTiger cache absorbs read hot spots, but writes (checkpoint flushes, journal commits, compaction) always hit disk. The choice of disk technology, filesystem, and kernel I/O settings determines the floor below which latency cannot go.

Storage stack diagram. Shows application -> MongoDB -> WiredTiger -> filesystem (XFS/ext4) -> I/O scheduler -> disk hardware (HDD/SSD/NVMe). Marks latency at each layer. Shows random vs sequential I/O patterns at each layer. Highlights that checkpoint writes are sequential and journal writes are sequential but latency-sensitive.

Disk Technology Comparison

MetricHDD (7200 RPM)SATA SSDNVMe SSD
Random read IOPS15050,000500,000+
Random write IOPS15030,000200,000+
Sequential read200 MB/s550 MB/s3,500 MB/s
Sequential write200 MB/s520 MB/s3,000 MB/s
Latency (random 4K read)8-12ms0.1ms0.02ms

For the telemetry platform at 50,000 writes/second, each write generates approximately 2-3 I/O operations (data + index + journal). Total I/O requirement: 100,000-150,000 IOPS.

  • HDD: Cannot sustain 150,000 IOPS. Maximum sustained: 150 IOPS per drive. Would need 1,000 drives in RAID.
  • SATA SSD: 30,000 write IOPS per drive. Would need 5 drives.
  • NVMe: 200,000+ write IOPS per drive. A single drive is sufficient.
# Benchmark disk I/O with fio (run before deploying MongoDB)
fio --name=random-write --rw=randwrite --bs=4k --size=4G \
    --numjobs=16 --iodepth=32 --runtime=60 --group_reporting

# Expected output for NVMe:
# write: IOPS=280k, BW=1094MiB/s, avg latency=1.8us

# Expected output for gp3 (3000 IOPS):
# write: IOPS=3000, BW=11.7MiB/s, avg latency=5.3ms

Filesystem Selection

MongoDB recommends XFS. The performance difference against ext4 is measurable:

WorkloadXFSext4Difference
Insert throughput (50K docs/s)52,000/s45,000/s+15%
Checkpoint duration (1.5 GB flush)4.2s5.8s-28%
Compaction speed180 MB/s140 MB/s+29%
fallocate (journal pre-allocation)0.01s0.4s40x faster

XFS advantages for MongoDB:

  • Extent-based allocation: Large contiguous writes (checkpoints) are more efficient.
  • Concurrent I/O: XFS handles parallel I/O from multiple WiredTiger threads better.
  • fallocate support: Journal file pre-allocation is near-instant on XFS vs seconds on ext4.
  • No double journaling: Mount with noatime,noexec,nodev for best performance.
# Format and mount XFS for MongoDB
mkfs.xfs /dev/nvme1n1
mount -o noatime,noexec,nodev /dev/nvme1n1 /data/db

readahead Tuning

Linux reads ahead of the current position to pre-fetch data that will likely be needed next. The default readahead is 128 KB (256 sectors). For MongoDB, this is too high for random read workloads and wastes I/O bandwidth.

# Check current readahead
blockdev --getra /dev/nvme1n1
# 256 (sectors) = 128 KB

# Set readahead to 16 KB (32 sectors) for MongoDB
blockdev --setra 32 /dev/nvme1n1

# Make persistent via udev rule
echo 'ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/read_ahead_kb}="16"' \
  > /etc/udev/rules.d/99-mongodb.rules

MongoDB’s WiredTiger manages its own read patterns. Large readahead wastes I/O on data that WiredTiger will not use (because it reads specific pages, not sequential ranges). Setting readahead to 8-32 KB reduces wasted I/O.