Skip to main content
fast by design

Disk I/O Performance: Sequential vs Random, fsync Cost, and the Storage Choice That Determines Your Ceiling

15 min read Chapter 82 of 90

Disk I/O Performance: Sequential vs Random, fsync Cost, and the Storage Choice That Determines Your Ceiling

The content platform’s PostgreSQL database handles 4,200 writes per second during peak ingestion. Each write appends to the write-ahead log (WAL), which calls fsync to guarantee durability. The database runs on a general-purpose EBS volume (gp3) attached to an EC2 instance. Average write latency as reported by pg_stat_wal is 1.8ms. That number seems reasonable until you benchmark what the hardware can actually do.

A local NVMe drive on the same instance type completes an fsync in 35 microseconds. The gp3 volume takes 1.8 milliseconds. The database spends 51x longer waiting for storage confirmation on every single WAL write. At 4,200 writes per second, the cumulative wait is 7.56 seconds of blocking per second of wall clock time. The database compensates by batching commits, but single-transaction latency pays the full price.

This is the storage ceiling. No amount of query optimization, connection pooling, or caching eliminates it. The storage device under the database sets a hard floor on write latency, and every layer above inherits that floor.

The Two Dimensions of Storage Performance

Storage performance has two independent dimensions that people conflate: throughput and IOPS.

Throughput measures bytes per second for large sequential transfers. A gp3 volume delivers 125 MB/s baseline throughput. A local NVMe drive delivers 3,500 MB/s. Throughput matters for sequential scans, backup restoration, and bulk data loading.

IOPS measures operations per second for small random accesses. A gp3 volume delivers 3,000 baseline IOPS. A local NVMe drive delivers 800,000 IOPS. IOPS matters for database index lookups, random page reads, and WAL writes.

Storage performance dimensions for the content platform:

Workload type          Metric that matters    Why
-----------------      -------------------    ---
WAL writes             IOPS + fsync latency   Each commit = 1 fsync, 4-8KB write
Index lookups          Random read IOPS       B-tree traversal = 3-4 random 8KB reads
Full-text search       Sequential throughput  Scanning posting lists = large sequential reads
Backup (pg_dump)       Sequential throughput  Streaming table data = sequential reads
VACUUM                 Mixed IOPS + throughput Reading dead tuples (random) + writing (sequential)
Qdrant vector search   Random read IOPS       HNSW graph traversal = many random reads

Most database workloads are IOPS-bound, not throughput-bound. A query that reads 4 index pages does 4 random 8KB reads. Total data: 32KB. At 3,000 IOPS, those 4 reads take 1.3ms. At 800,000 IOPS, they take 5 microseconds. The throughput of the device is irrelevant because the transfer size is tiny. The bottleneck is the number of operations the device can process per second.

Sequential vs Random: The Physics

On spinning disks (HDDs), the gap between sequential and random performance is enormous. Sequential reads deliver 150-200 MB/s. Random 4KB reads deliver about 100 IOPS, or 0.4 MB/s. That is a 400x difference, caused by physical seek time (moving the read head) and rotational latency (waiting for the platter to spin to the right sector).

SSDs eliminated mechanical movement. Random reads on a SATA SSD deliver 50,000-90,000 IOPS. But the gap between sequential and random did not disappear. It shrank from 400x to about 5-8x. The remaining gap comes from three sources:

Flash translation layer (FTL) overhead. The SSD controller maintains a mapping table from logical block addresses to physical NAND locations. Random writes scatter across the mapping table, requiring more FTL lookups per operation than sequential writes that map to contiguous physical blocks.

Read-ahead inefficiency. The kernel’s block layer prefetches data when it detects sequential patterns. For sequential reads, the prefetch hits. For random reads, the prefetched data is wasted bandwidth. The default read-ahead window is 128KB (32 pages of 4KB). Every random 4KB read triggers a 128KB prefetch of which 124KB is thrown away.

Internal parallelism. NVMe drives contain multiple NAND channels and dies. Sequential I/O naturally stripes across channels, achieving maximum internal parallelism. Random I/O may hit the same channel repeatedly, serializing access.

Measured IOPS by access pattern (fio, iodepth=32, 4KB blocks):

Device                Sequential Read    Random Read    Ratio    Sequential Write    Random Write    Ratio
--------------------  ----------------   -----------    -----    ----------------    ------------    -----
HDD (7200 RPM)             180 IOPS       100 IOPS     1.8x         170 IOPS          95 IOPS      1.8x
SATA SSD (860 EVO)      93,000 IOPS    52,000 IOPS     1.8x      48,000 IOPS      32,000 IOPS      1.5x
NVMe (970 EVO Plus)    520,000 IOPS   340,000 IOPS     1.5x     480,000 IOPS     310,000 IOPS      1.5x
NVMe (Intel P5800X)    900,000 IOPS   800,000 IOPS     1.1x     850,000 IOPS     780,000 IOPS      1.1x
EBS gp3 (baseline)       3,000 IOPS     3,000 IOPS     1.0x       3,000 IOPS       3,000 IOPS      1.0x
EBS io2 (provisioned)   64,000 IOPS    64,000 IOPS     1.0x      64,000 IOPS      64,000 IOPS      1.0x

Disk I/O Performance: IOPS Comparison by Storage Type

Notice that EBS volumes show identical sequential and random IOPS. This is because the storage is network-attached. The bottleneck is not the physical media but the network path and the EBS service’s token bucket rate limiter. Whether the access pattern is sequential or random is irrelevant when every I/O traverses a network round trip.

Queue Depth: The Hidden Multiplier

A single-threaded application issuing one I/O at a time sees a fraction of what a storage device can deliver. NVMe drives are designed for parallel command processing. The NVMe specification supports 65,535 submission queues, each holding 65,536 commands.

Queue depth measures how many I/O operations are in-flight simultaneously. At queue depth 1, the application waits for each operation to complete before submitting the next. The device sits idle during the software processing between submissions.

NVMe 970 EVO Plus, 4KB random reads by queue depth:

Queue Depth    IOPS        Latency (avg)    Bandwidth
-----------    -------     -------------    ---------
QD=1            12,800        78 us          50 MB/s
QD=4            51,200        78 us         200 MB/s
QD=16          204,800        78 us         800 MB/s
QD=32          340,000        94 us        1328 MB/s
QD=64          370,000       173 us        1445 MB/s
QD=128         380,000       337 us        1484 MB/s

Observations:
  - IOPS scale linearly with QD until the device saturates (~QD=32)
  - Latency stays flat until saturation, then rises
  - QD=1 delivers 3.4% of peak IOPS
  - QD=32 delivers 89% of peak IOPS

PostgreSQL achieves queue depth greater than 1 through effective_io_concurrency. The default is 1. For NVMe storage, setting it to 200 tells the planner it can issue 200 concurrent prefetch requests during bitmap heap scans. The parallel workers in a parallel query each contribute their own I/O, further increasing effective queue depth.

-- SLOW: default io concurrency on NVMe storage
SET effective_io_concurrency = 1;

-- Bitmap heap scan on articles table (12GB, 2.4M rows)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM articles WHERE category_id IN (3, 7, 12)
  AND published_at > '2025-01-01';

-- Bitmap Heap Scan on articles
--   Rows Removed by Filter: 180,432
--   Buffers: shared hit=28402 read=14208
--   I/O Timings: read=892.4ms
--   Planning Time: 0.8ms
--   Execution Time: 1284.2ms
-- FAST: tuned io concurrency for NVMe
SET effective_io_concurrency = 200;

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM articles WHERE category_id IN (3, 7, 12)
  AND published_at > '2025-01-01';

-- Bitmap Heap Scan on articles
--   Rows Removed by Filter: 180,432
--   Buffers: shared hit=28402 read=14208
--   I/O Timings: read=148.6ms
--   Planning Time: 0.8ms
--   Execution Time: 442.8ms

Same query, same data, same number of disk reads. The only difference: the database issued prefetch requests in parallel instead of serially. I/O time dropped from 892ms to 149ms, a 6x improvement. Total execution time dropped from 1284ms to 443ms.

The fsync Tax

fsync forces the operating system to flush a file’s in-memory buffers to persistent storage and wait for the device to confirm the data is on stable media. Without fsync, the OS may hold written data in the page cache indefinitely. A power loss would lose that data.

Databases call fsync on every WAL write (or every commit group in group commit mode) to guarantee durability. This is not optional for ACID compliance. The fsync latency becomes a direct component of commit latency.

fsync latency by storage type (measured with pg_test_fsync):

Storage Type                 fdatasync    fsync    open_sync
--------------------------   ---------    -----    ---------
Local NVMe (Intel P5800X)      18 us      22 us      20 us
Local NVMe (Samsung 970)       35 us      42 us      38 us
Local SATA SSD (860 EVO)      180 us     210 us     195 us
EBS gp3 (baseline)           1,800 us   2,100 us   1,950 us
EBS io2 (64K IOPS)             450 us     520 us     480 us
EBS io2 (Block Express)        200 us     240 us     220 us

Impact on PostgreSQL commit throughput (single-threaded):
  Local NVMe:     28,571 commits/sec  (1 / 0.000035)
  EBS gp3:           556 commits/sec  (1 / 0.001800)
  Ratio:           51x difference

The content platform inserts analytics events in batches. Each batch is a single transaction with 50 rows. On gp3, each batch commit blocks for 1.8ms waiting for fsync. On local NVMe, it blocks for 35 microseconds. The 50 rows complete in the same time regardless. The fsync wait dominates.

Group Commit: Amortizing the fsync

PostgreSQL’s group commit mechanism batches multiple concurrent transactions into a single WAL flush. The commit_delay parameter adds a short wait after the first transaction signals readiness to commit, allowing other transactions to join the batch. A single fsync then covers all transactions in the group.

Group commit effect on throughput (16 concurrent connections, gp3 storage):

commit_delay     Commits/sec    WAL writes/sec    Avg commit latency
------------     -----------    --------------    ------------------
0 (disabled)        4,200          4,200              3.2ms
100 us              8,800          1,100              4.1ms
500 us             11,200            420              5.8ms
2000 us            12,400            180              9.2ms

Analysis:
  - Without group commit: 4,200 fsyncs/sec, each costing 1.8ms
  - With commit_delay=500us: 420 fsyncs/sec, ~27 commits per fsync
  - Throughput improved 2.7x, but latency increased 1.8x
  - Trade-off: throughput for latency

Group commit helps throughput but increases individual transaction latency. For the content platform, the analytics ingestion pipeline benefits from group commit because it prioritizes throughput. The article-serving queries do not benefit because they are single-transaction reads followed by a single-transaction cache update.

Read-Ahead: Helping Sequential, Hurting Random

The Linux kernel’s read-ahead mechanism detects sequential read patterns and prefetches data before the application requests it. This converts synchronous reads into asynchronous prefetches, hiding I/O latency behind computation.

# Check current read-ahead setting (in 512-byte sectors)
blockdev --getra /dev/nvme0n1
# 256  (= 128KB)

# Sequential workload: read-ahead helps
# fio --name=seq --rw=read --bs=4k --size=1G --direct=1
# Without read-ahead (0):   Sequential read: 52,000 IOPS
# With read-ahead (256):    Sequential read: 93,000 IOPS  (+79%)

# Random workload: read-ahead wastes bandwidth
# fio --name=rand --rw=randread --bs=4k --size=1G --direct=1
# Without read-ahead (0):   Random read: 340,000 IOPS
# With read-ahead (256):    Random read: 310,000 IOPS  (-8.8%)

For database workloads that are primarily random (index lookups, WAL writes), reducing read-ahead can improve performance. PostgreSQL manages its own prefetching through effective_io_concurrency, making kernel read-ahead redundant for most operations.

# Reduce read-ahead for database volumes
blockdev --setra 32 /dev/nvme0n1    # 16KB instead of 128KB

# Keep higher read-ahead for volumes with sequential workloads
blockdev --setra 2048 /dev/nvme1n1  # 1MB for backup/archival volume

Measuring What You Have: The fio Baseline

Before tuning anything, establish what your storage actually delivers. fio (Flexible I/O Tester) is the standard tool. These four tests give you the essential numbers:

# Test 1: Sequential read throughput
fio --name=seq-read --rw=read --bs=1M --size=4G --numjobs=1 \
    --iodepth=32 --direct=1 --ioengine=libaio --runtime=30 --time_based

# Test 2: Sequential write throughput
fio --name=seq-write --rw=write --bs=1M --size=4G --numjobs=1 \
    --iodepth=32 --direct=1 --ioengine=libaio --runtime=30 --time_based

# Test 3: Random read IOPS (database index pattern)
fio --name=rand-read --rw=randread --bs=4k --size=4G --numjobs=1 \
    --iodepth=32 --direct=1 --ioengine=libaio --runtime=30 --time_based

# Test 4: Random write IOPS with fsync (WAL pattern)
fio --name=wal-write --rw=randwrite --bs=8k --size=1G --numjobs=1 \
    --iodepth=1 --direct=1 --fsync=1 --ioengine=libaio --runtime=30 --time_based

Test 4 is the most important for database workloads. It writes 8KB blocks (matching PostgreSQL’s WAL segment write size) with fsync=1, meaning every write is followed by an fsync. The iodepth is 1 because a single WAL writer serializes fsync calls. This test directly predicts your single-threaded WAL write throughput.

Content platform fio results across storage options:

Test                    Local NVMe      gp3         io2 (64K)    io2 Block Express
----                    ----------      ---         ---------    -----------------
Seq read (MB/s)            3,480        125            1,000           4,000
Seq write (MB/s)           3,200        125            1,000           4,000
Rand read IOPS           340,000      3,000           64,000         256,000
WAL write IOPS            28,571        556            2,222          5,000
(8KB, fsync=1, QD=1)

Price ($/month, 500GB)       $0*       $40             $640           $1,280

* Included with instance, no additional charge

The WAL write test reveals the true ceiling. On gp3, the maximum single-threaded commit rate is 556/sec. Everything built on top of this database inherits that limit. Moving to local NVMe gives 51x headroom. Moving to io2 Block Express gives 9x headroom at 32x the cost.

The I/O Scheduler: Choosing the Right Algorithm

Linux offers multiple I/O schedulers. The choice affects latency distribution and throughput for different workload patterns.

Available I/O schedulers:

none     No reordering. Requests go directly to device.
         Best for NVMe (device has its own scheduler).
         Lowest latency, highest IOPS.

mq-deadline   Deadline-based. Guarantees no request starves beyond a timeout.
              Best for SATA SSDs. Prevents read starvation during heavy writes.

bfq      Budget Fair Queueing. Provides fairness between processes.
         Best for shared environments. Higher CPU overhead.
         Not recommended for high-IOPS workloads.

kyber    Lightweight two-level scheduler with latency targets.
         Balances read and write latency.
         Good for cloud block storage (EBS).
# Check current scheduler
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq

# Change scheduler for NVMe database volume
echo none > /sys/block/nvme0n1/queue/scheduler

# Change scheduler for SATA SSD
echo mq-deadline > /sys/block/sda/queue/scheduler

For the content platform’s NVMe database volume, none is correct. The NVMe controller has 4 ARM cores running its own scheduling algorithm. Adding a kernel scheduler on top adds latency without benefit. Measured difference: none delivers 3% lower average latency and 11% lower P99 latency compared to mq-deadline on NVMe.

Storage Type Decision Matrix

The content platform has four storage workloads. Each has different requirements:

Workload                Primary metric     Access pattern    Durability    Choice
--------------------    ---------------    --------------    ----------    ------
PostgreSQL WAL          fsync latency      Sequential write  Critical      Local NVMe
PostgreSQL data         Random read IOPS   Random read       Important     Local NVMe
Qdrant vector index     Random read IOPS   Random read       Rebuildable   Local NVMe
Static content (dist/)  Seq read throughput Sequential read   Replaceable   gp3 (cheap)
Backups                 Seq write throughput Sequential write  Archival     S3 Standard

The decision hinges on one question: does this workload call fsync in the hot path? If yes, it needs the lowest-latency storage available. If no, network-attached storage is fine because the kernel page cache absorbs most reads and the application tolerates write buffering.

PostgreSQL WAL calls fsync in the hot path. Every commit blocks on fsync. Local NVMe is the only option that keeps commit latency under 100 microseconds.

Qdrant’s vector index is memory-mapped. Reads go through the page cache. The storage device matters only for cold starts (loading the index from disk) and index rebuilds. For steady-state reads, the page cache hit rate exceeds 99%. Local NVMe for initial load speed, but gp3 would work if cold start time is acceptable.

The Proof: Before and After Storage Migration

The content platform migrated PostgreSQL from gp3 to a local NVMe instance store. The migration required switching to an instance type with local storage (i3en.xlarge instead of m5.xlarge), which changed the cost profile.

Before (gp3, 500GB, 3000 IOPS baseline):

  WAL write latency (fsync):     1.8ms average, 4.2ms P99
  Single-row INSERT latency:     2.4ms average, 5.8ms P99
  Batch INSERT (50 rows):        3.1ms average, 7.2ms P99
  Analytics ingestion rate:      4,200 rows/sec (saturated)
  Article query (index lookup):  1.2ms average, 3.4ms P99

After (local NVMe i3en.xlarge):

  WAL write latency (fsync):     0.035ms average, 0.082ms P99
  Single-row INSERT latency:     0.18ms average, 0.42ms P99
  Batch INSERT (50 rows):        0.31ms average, 0.68ms P99
  Analytics ingestion rate:      28,000 rows/sec (CPU-limited now)
  Article query (index lookup):  0.14ms average, 0.38ms P99

Cost change:
  gp3 volume: $40/month
  i3en.xlarge vs m5.xlarge: +$92/month
  Net increase: $52/month for 6.6x ingestion throughput and 8.6x lower query latency

The ingestion pipeline went from storage-bound to CPU-bound. The ceiling moved from the disk to the processor. Query latency dropped because index page reads now complete in microseconds instead of milliseconds. The page cache hit rate improved because local NVMe has no network jitter to cause cache bypass.

The Trade-off: Durability vs Performance

Local NVMe instance stores are ephemeral. When the EC2 instance stops, the data is lost. This requires architectural changes:

  1. Streaming replication. A synchronous standby on a separate instance (also with local NVMe) receives every WAL record before the primary confirms the commit. Data survives single-instance failure.

  2. Continuous archiving. WAL segments ship to S3 every 60 seconds via archive_command. Point-in-time recovery is possible to within the last archived segment.

  3. Automated snapshots. A cron job runs pg_basebackup to S3 every 6 hours. Full recovery takes 12 minutes (base backup restore + WAL replay).

The durability guarantee changes from “data survives disk failure” (EBS replicates across AZs) to “data survives instance failure via replication.” This is a weaker guarantee that requires more operational machinery. For the content platform, the 8.6x latency improvement justifies the complexity. For a system with less performance pressure, EBS with provisioned IOPS (io2) offers a middle ground: lower latency than gp3, higher durability than instance store.

The storage device under your database is the performance ceiling. Everything above it can only get slower, never faster. Measure it with fio. Know the fsync cost. Make the storage choice deliberately, because it is the one decision you cannot optimize around later.