Sequential vs Random I/O: Measuring Your Storage

The main chapter showed a 51x difference in fsync latency between local NVMe and EBS gp3. That number came from pg_test_fsync, a PostgreSQL-specific tool. This section goes wider: how to measure every dimension of storage performance using fio, how to interpret the results, what queue depth does to your numbers, why read-ahead helps some workloads and hurts others, and how to build a benchmark suite that matches your actual application’s I/O pattern.

The fio Mental Model

fio generates I/O workloads with precise control over every parameter that affects performance. Think of it as a synthetic application that reads and writes in patterns you specify. The critical parameters fall into four categories:

Access pattern (--rw): sequential read, sequential write, random read, random write, or mixed. This determines whether consecutive I/O operations target adjacent blocks or scattered blocks.

Block size (--bs): the size of each individual I/O operation. PostgreSQL uses 8KB pages. Qdrant uses 4KB pages for HNSW graph nodes. Backup tools use 1MB or larger. The block size must match your application’s actual I/O size or the benchmark is measuring something irrelevant.

Queue depth (--iodepth): how many I/O operations are in-flight simultaneously. This is the single most influential parameter on NVMe performance and the one most often set incorrectly.

I/O engine (--ioengine): the system call interface used to submit I/O. libaio (Linux native async I/O) is the standard for benchmarks because it supports true asynchronous submission. io_uring is the modern replacement with lower overhead. sync uses blocking reads/writes and is only useful for measuring single-threaded serial access.

# The anatomy of a fio command:
fio \
  --name=test-name \        # Label for the test
  --rw=randread \            # Access pattern
  --bs=8k \                  # Block size (match your application)
  --size=4G \                # Total file size to operate on
  --numjobs=1 \              # Number of parallel workers
  --iodepth=32 \             # Queue depth per worker
  --direct=1 \               # Bypass page cache (measure device, not RAM)
  --ioengine=libaio \        # Async I/O engine
  --runtime=60 \             # Run for 60 seconds
  --time_based \             # Keep running until runtime, ignore size
  --group_reporting \        # Aggregate results across jobs
  --filename=/dev/nvme0n1p4  # Target device or file

The --direct=1 flag is essential for device benchmarks. Without it, fio reads and writes through the kernel page cache. A 4GB file on a machine with 64GB RAM fits entirely in cache. You end up benchmarking RAM speed (10 GB/s), not storage speed. With --direct=1, every I/O goes to the device.

Building a Database-Relevant Benchmark Suite

Generic storage benchmarks (CrystalDiskMark, hdparm) run tests that do not match database workload patterns. A database benchmark suite needs four specific tests:

Test 1: WAL Write Pattern

The WAL writer appends 8KB records sequentially and calls fdatasync after each write group. This is the most latency-sensitive I/O in the system.

# WAL write simulation: sequential 8KB writes with fsync
fio --name=wal-write \
    --rw=write \
    --bs=8k \
    --size=2G \
    --numjobs=1 \
    --iodepth=1 \             # WAL writer is single-threaded
    --direct=1 \
    --fsync=1 \               # fsync after every write
    --ioengine=libaio \
    --runtime=60 --time_based \
    --filename=/mnt/db-volume/fio-test

# Key output fields:
#   write: IOPS=28571, BW=223MiB/s
#   lat (usec): min=28, max=142, avg=35, stdev=8.2
#   fsync: avg=34.8 usec

Queue depth 1 is mandatory. The WAL writer calls fdatasync and waits for it to return before writing the next record. There is no parallelism. This test reveals the true serial fsync cost that the database will experience.

Test 2: Index Lookup Pattern

B-tree index traversals perform random 8KB reads. A typical lookup traverses 3-4 levels of the tree, reading one page per level. Multiple concurrent queries produce concurrent random reads.

# Index lookup simulation: random 8KB reads at database concurrency
fio --name=index-lookup \
    --rw=randread \
    --bs=8k \
    --size=4G \
    --numjobs=8 \             # 8 concurrent backend processes
    --iodepth=4 \             # Each backend may prefetch 4 pages
    --direct=1 \
    --ioengine=libaio \
    --runtime=60 --time_based \
    --filename=/mnt/db-volume/fio-test

# Key output fields:
#   read: IOPS=342000, BW=2672MiB/s
#   lat (usec): min=18, max=892, avg=42, stdev=31.4
#   clat percentiles:
#     50.00th=[ 38], 95.00th=[ 68], 99.00th=[ 108], 99.99th=[ 490]

The numjobs and iodepth should approximate your database’s concurrent I/O. PostgreSQL with 8 active backends doing index scans produces roughly this pattern. The P99 latency matters more than the average because tail latency in index lookups translates directly to query tail latency.

Test 3: Sequential Scan Pattern

Full table scans and VACUUM read large contiguous regions. The block size should be larger (256KB-1MB) because the database reads ahead in chunks during sequential scans.

# Sequential scan simulation
fio --name=seq-scan \
    --rw=read \
    --bs=256k \
    --size=8G \
    --numjobs=2 \
    --iodepth=16 \
    --direct=1 \
    --ioengine=libaio \
    --runtime=60 --time_based \
    --filename=/mnt/db-volume/fio-test

# Key output fields:
#   read: IOPS=13600, BW=3400MiB/s
#   lat (usec): min=580, max=4200, avg=1180

Test 4: Mixed Read/Write Pattern

Real database workloads mix reads and writes. The ratio depends on your application. The content platform runs approximately 70% reads (article serving) and 30% writes (analytics ingestion + WAL).

# Mixed workload simulation
fio --name=mixed \
    --rw=randrw \
    --rwmixread=70 \
    --bs=8k \
    --size=4G \
    --numjobs=8 \
    --iodepth=4 \
    --direct=1 \
    --ioengine=libaio \
    --runtime=60 --time_based \
    --filename=/mnt/db-volume/fio-test

# Key output fields:
#   read:  IOPS=238000, BW=1859MiB/s, avg lat=48us
#   write: IOPS=102000, BW=797MiB/s, avg lat=62us

Write IOPS are always lower than read IOPS on the same device. Writes require the FTL to find free NAND pages, erase blocks when necessary, and update mapping tables. This asymmetry is inherent to flash storage.

Queue Depth Deep Dive

Queue depth is the most misunderstood storage parameter. Running fio at QD=1 and concluding that your NVMe drive is slow is like testing a 16-lane highway by sending one car at a time.

NVMe drives achieve high IOPS through internal parallelism. A modern NVMe SSD contains 4-8 NAND channels, each connected to multiple dies. Each die has multiple planes that can operate independently. A single I/O operation uses one plane on one die on one channel. The other planes, dies, and channels sit idle.

NVMe internal parallelism (Samsung 970 EVO Plus):

Channels: 4
Dies per channel: 4
Planes per die: 2
Total parallel units: 32

At QD=1:  1 of 32 units active  =  3.1% utilization
At QD=8:  8 of 32 units active  = 25.0% utilization
At QD=32: 32 of 32 units active = 100% utilization (theoretical)

This is why IOPS scales linearly with queue depth until the device saturates. Each additional in-flight I/O can target a different internal parallel unit.

Queue depth scaling test (4KB random reads, NVMe 970 EVO Plus):

QD     IOPS        Avg Lat    P99 Lat    Bandwidth    Device Util
--     ------      -------    -------    ---------    ----------
1      12,800       78 us     142 us       50 MB/s        4%
2      25,600       78 us     148 us      100 MB/s        8%
4      51,200       78 us     155 us      200 MB/s       15%
8     102,400       78 us     168 us      400 MB/s       30%
16    204,800       78 us     185 us      800 MB/s       60%
32    340,000       94 us     248 us    1,328 MB/s       89%
64    370,000      173 us     512 us    1,445 MB/s       97%
128   380,000      337 us   1,024 us    1,484 MB/s       99%
256   382,000      670 us   2,048 us    1,492 MB/s      100%

Three phases are visible:

Linear scaling (QD 1-16). IOPS doubles with each doubling of queue depth. Average latency stays constant because each I/O gets its own internal parallel unit. The device has spare capacity.

Saturation onset (QD 16-64). IOPS growth slows. Average latency starts rising because some I/O operations must queue behind others for the same internal unit. The device is approaching capacity.

Full saturation (QD 64+). IOPS plateaus. Latency increases linearly with queue depth because every new I/O waits in a queue. Adding more parallelism now only adds latency.

PostgreSQL and Queue Depth

PostgreSQL generates I/O queue depth through three mechanisms:

Backend concurrency. Each active backend process issues I/O independently. With max_connections=100 and 20 backends actively doing disk reads, the effective queue depth is up to 20.

effective_io_concurrency. During bitmap heap scans, the executor issues asynchronous prefetch requests ahead of the scan position. The default is 1. For NVMe, setting this to 200 allows the executor to have up to 200 prefetch requests in flight.

Parallel workers. Each worker in a parallel query generates its own I/O stream. Three parallel workers doing a parallel sequential scan produce QD=3 for that query alone.

-- SLOW: PostgreSQL defaults, NVMe storage underutilized
-- postgresql.conf
effective_io_concurrency = 1          -- 1 prefetch at a time
maintenance_io_concurrency = 10       -- VACUUM/CREATE INDEX

-- FAST: tuned for NVMe
effective_io_concurrency = 200        -- Match NVMe internal parallelism
maintenance_io_concurrency = 200      -- Faster VACUUM and index builds

VACUUM on articles table (12GB, 2.4M rows):

effective_io_concurrency=1:     VACUUM completed in 48.2 seconds
effective_io_concurrency=200:   VACUUM completed in 8.4 seconds

5.7x faster, same I/O volume, more parallelism.

Read-Ahead Mechanics

The kernel read-ahead algorithm lives in mm/readahead.c. It tracks sequential access per file descriptor. When it detects that an application is reading pages in order, it submits asynchronous read requests for pages ahead of the current position.

Read-ahead state machine:

1. Initial read: App reads page N.
   Kernel notes the access but does not prefetch.

2. Second sequential read: App reads page N+1.
   Kernel detects a sequential pattern.
   Prefetches pages N+2 through N+5 (initial window = 4 pages).

3. Continued sequential reads: App reads N+2, N+3.
   Kernel doubles the prefetch window to 8 pages.
   Prefetches pages N+6 through N+13.

4. Window growth continues until max_read_ahead_kb.
   Default max: 128KB = 32 pages of 4KB.
   On NVMe with high bandwidth: can increase to 2MB+.

5. Random access detected: App reads page M (far from N).
   Kernel resets the prefetch window.
   No prefetching until a new sequential pattern emerges.

The read-ahead setting (blockdev --getra) controls the maximum window size in 512-byte sectors. The default of 256 sectors equals 128KB. This works well for sequential workloads but has side effects on random workloads.

Read-Ahead and Random I/O Interference

When the kernel issues a read-ahead for 128KB but the application only needs 4KB, the excess 124KB wastes three resources:

Device bandwidth. The 124KB of unwanted data occupies NVMe queue slots and internal bandwidth. On a device doing 340,000 4KB random IOPS, adding 31 unwanted pages per read could theoretically consume 31x the bandwidth.

Page cache memory. The prefetched pages enter the page cache, potentially evicting hot pages that will be needed again. On a machine with 64GB RAM running a 50GB database, every page evicted from cache means a future cache miss.

CPU. The completion interrupts for unwanted I/O consume CPU cycles for no benefit.

In practice, the kernel’s read-ahead heuristic is conservative with random patterns. It only triggers after detecting two consecutive sequential accesses to the same file descriptor. But database workloads can accidentally trigger it. A B-tree leaf page scan (reading pages 1000, 1001, 1002 within a leaf chain) looks sequential to the kernel, triggering prefetch. If the next access jumps to page 48,000 (a different part of the index), the prefetch was wasted.

# Measure read-ahead impact on mixed database workload

# High read-ahead (default)
blockdev --setra 256 /dev/nvme0n1
fio --name=mixed --rw=randrw --rwmixread=70 --bs=8k --size=4G \
    --numjobs=8 --iodepth=4 --direct=0 --ioengine=libaio --runtime=60

# Result: read IOPS=198,000  write IOPS=85,000

# Low read-ahead (tuned for random)
blockdev --setra 16 /dev/nvme0n1   # 8KB = 1 database page
fio --name=mixed --rw=randrw --rwmixread=70 --bs=8k --size=4G \
    --numjobs=8 --iodepth=4 --direct=0 --ioengine=libaio --runtime=60

# Result: read IOPS=224,000  write IOPS=96,000
# Improvement: +13% read IOPS, +13% write IOPS

The 13% improvement comes from eliminating wasted prefetch I/O on a workload that the kernel cannot predict. Note that --direct=0 is used here (page cache enabled) because read-ahead only affects buffered I/O. With --direct=1, the page cache is bypassed entirely and read-ahead has no effect.

Tuning Read-Ahead Per Volume

Different volumes in the same system need different read-ahead settings. The content platform uses three volumes:

# Database volume: primarily random I/O
blockdev --setra 16 /dev/nvme0n1     # 8KB, one database page

# WAL volume: purely sequential
blockdev --setra 512 /dev/nvme1n1    # 256KB, let the kernel prefetch

# Backup volume: large sequential reads/writes
blockdev --setra 4096 /dev/nvme2n1   # 2MB, maximum prefetch for bulk transfer

Make these settings persistent via udev rules:

# /etc/udev/rules.d/60-disk-readahead.rules
ACTION=="add|change", KERNEL=="nvme0n1", ATTR{bdi/read_ahead_kb}="8"
ACTION=="add|change", KERNEL=="nvme1n1", ATTR{bdi/read_ahead_kb}="256"
ACTION=="add|change", KERNEL=="nvme2n1", ATTR{bdi/read_ahead_kb}="2048"

Interpreting fio Output

fio output contains more information than most people extract. The critical fields and what they mean:

rand-read: (groupid=0, jobs=8): err= 0: pid=12345
  read: IOPS=342k, BW=2672MiB/s (2802MB/s)(157GiB/60001msec)

    # IOPS: operations per second (what databases care about)
    # BW: bandwidth in MiB/s (what sequential scans care about)

    slat (nsec): min=1200, max=45000, avg=2100, stdev=820
    # Submission latency: time to submit I/O to kernel
    # Should be <10us. If higher, ioengine overhead or CPU contention.

    clat (usec): min=18, max=892, avg=42.1, stdev=31.4
    # Completion latency: time from submission to device completion
    # This is the device latency. The number you compare across devices.

    lat (usec): min=19, max=895, avg=44.2, stdev=31.8
    # Total latency: slat + clat. What the application sees.

    clat percentiles (usec):
     |  1.00th=[   24],  5.00th=[   28], 10.00th=[   30],
     | 20.00th=[   33], 30.00th=[   35], 40.00th=[   38],
     | 50.00th=[   40], 60.00th=[   42], 70.00th=[   45],
     | 80.00th=[   50], 90.00th=[   62], 95.00th=[   78],
     | 99.00th=[  112], 99.50th=[  148], 99.90th=[  392],
     | 99.95th=[  510], 99.99th=[  780]
    # Latency distribution. P99 and P99.9 reveal tail behavior.
    # A big gap between P99 and P99.99 indicates GC or scheduling jitter.

   bw (  KiB/s): min=2480000, max=2892000, per=100.00%, avg=2736128
    # Bandwidth over time. Large variance means the device is inconsistent.
    # Check min vs avg. If min < 50% of avg, the device has stalls.

   iops        : min=310000, max=361500, avg=342016
    # IOPS over time. Same variance analysis as bandwidth.

  cpu          : usr=8.42%, sys=18.64%, ctx=20508432, majf=0, minf=96
    # CPU usage. sys% > 30% means kernel I/O stack overhead is significant.
    # Consider io_uring to reduce syscall overhead.

The most common mistake is reporting only average latency. A storage device with 40us average and 112us P99 behaves very differently from one with 40us average and 5ms P99. Always report the percentile distribution, especially P99 and P99.9, because database query latency percentiles are bounded below by storage latency percentiles.

Benchmark Anti-Patterns

Five mistakes that produce misleading fio results:

Testing through the page cache. Omitting --direct=1 means the first run populates cache and subsequent runs read from RAM. The results show 10 GB/s “disk” throughput. Use --direct=1 for device benchmarks. Use --direct=0 only when you specifically want to measure the combined cache+device performance.

Insufficient runtime. Running fio for 5 seconds misses SSD garbage collection pauses, EBS credit depletion, and thermal throttling. Use --runtime=300 (5 minutes) minimum for production-relevant numbers. EBS burst credits deplete in 30 minutes of sustained load; run for at least 60 minutes to see baseline performance.

Wrong block size. Testing a database volume with 1MB blocks shows sequential throughput, which is irrelevant for index lookups. Match the block size to your application: 8KB for PostgreSQL, 4KB for most key-value stores, 16KB for MySQL/InnoDB.

Queue depth mismatch. Testing at QD=256 shows peak device capability but not what your application will achieve. A single PostgreSQL backend doing synchronous reads operates at QD=1. Test at the queue depth your application actually generates.

Testing a filesystem instead of a device. Filesystem metadata operations (extent allocation, journal writes) add overhead. Test both: raw device (--filename=/dev/nvme0n1) for device capability, and filesystem (--filename=/mnt/data/fio-test) for what the application will see. The gap reveals filesystem overhead.

Filesystem overhead (ext4 vs XFS vs raw, NVMe, 4KB random writes):

Target              IOPS        Avg Latency    Overhead
-----------------   --------    -----------    --------
Raw device          310,000       103 us          0%
XFS                 285,000       112 us         8.1%
ext4                272,000       118 us        12.3%
ext4 with journal   248,000       129 us        20.0%

XFS outperforms ext4 for random writes because its extent-based
allocator requires fewer metadata updates per write. Both add
overhead vs raw device due to journal writes and inode updates.

Run all four benchmark tests, record the results in a spreadsheet, and compare them to the main chapter’s reference numbers. The delta between your measured performance and the device’s specification sheet tells you how much overhead your software stack adds. That overhead is your optimization target.