Skip to main content
fast by design

Storage Selection for Database Workloads

13 min read Chapter 84 of 90

Storage Selection for Database Workloads

The main chapter showed that moving PostgreSQL from gp3 to local NVMe dropped single-row INSERT latency from 2.4ms to 0.18ms. This section covers the implementation: separating WAL onto its own disk, placing tablespaces by access pattern, configuring cloud storage volumes for specific latency targets, and measuring every change to verify it delivers the expected improvement.

WAL on a Separate Disk

PostgreSQL’s WAL writer and the query executor compete for the same disk when they share a volume. The WAL writer needs sequential write bandwidth with minimal fsync latency. The query executor needs random read IOPS for index lookups. On a single volume, these two patterns interfere.

The WAL writer holds the WAL insertion lock while writing. If the underlying disk is busy servicing a random read for a query, the WAL write stalls. Every transaction waiting to commit stalls behind it. A queue forms. Under load, this mutual interference adds 200-400 microseconds to every commit.

Shared volume baseline (PostgreSQL 16, local NVMe, 16 concurrent clients):

pgbench -c 16 -j 4 -T 300 -P 5

  tps = 42,800
  avg latency = 0.37ms
  P99 latency = 1.24ms

  WAL write latency (pg_stat_wal):
    avg = 0.041ms
    max = 0.892ms
    wal_sync_time = 14.2s over 300s run

  iostat (nvme0n1, 1-second samples):
    avg read IOPS  = 128,000
    avg write IOPS = 38,000
    await (avg)    = 0.082ms
    await (max)    = 1.240ms

Separating WAL onto its own NVMe device eliminates the interference:

# Create WAL directory on dedicated NVMe
mkfs.xfs /dev/nvme1n1
mount -o noatime,discard /dev/nvme1n1 /mnt/wal

# Move WAL to separate volume (requires restart)
pg_ctl stop -D /var/lib/postgresql/16/main
mv /var/lib/postgresql/16/main/pg_wal /mnt/wal/pg_wal
ln -s /mnt/wal/pg_wal /var/lib/postgresql/16/main/pg_wal
pg_ctl start -D /var/lib/postgresql/16/main
Separate WAL volume (same test, WAL on nvme1n1, data on nvme0n1):

pgbench -c 16 -j 4 -T 300 -P 5

  tps = 58,200          (+36%)
  avg latency = 0.27ms  (-27%)
  P99 latency = 0.68ms  (-45%)

  WAL write latency (pg_stat_wal):
    avg = 0.028ms       (-32%)
    max = 0.184ms       (-79%)
    wal_sync_time = 8.4s over 300s run  (-41%)

  iostat (nvme1n1, WAL volume):
    avg write IOPS = 38,000
    await (avg)    = 0.026ms
    await (max)    = 0.142ms
    r/s            = 0       (no reads, WAL is write-only in normal operation)

The WAL volume shows zero read IOPS. WAL is append-only during normal operation. Reads from WAL happen only during crash recovery and replication. A dedicated WAL volume processes only sequential writes with fsync, the exact workload that NVMe excels at.

The P99 latency dropped from 1.24ms to 0.68ms. The improvement comes from eliminating the worst case: a WAL fsync waiting behind a burst of random data reads. With separate volumes, the WAL fsync never waits behind anything.

XFS vs ext4 for WAL

The WAL volume filesystem matters because the filesystem journal adds its own fsync overhead on top of the database’s fsync.

WAL volume filesystem comparison (same NVMe, pgbench -c 16):

Filesystem    Mount options                     TPS      WAL fsync avg
----------    --------------------------------  ------   -------------
XFS           noatime,discard                   58,200     0.028ms
XFS           noatime,discard,nobarrier         61,400     0.022ms  *unsafe*
ext4          noatime,discard                   54,800     0.034ms
ext4          noatime,discard,data=writeback    57,100     0.029ms
ext4          noatime,discard,nobarrier         59,800     0.024ms  *unsafe*

*nobarrier disables filesystem write barriers.
 This is unsafe: a power loss can corrupt the filesystem journal.
 Only use with battery-backed write cache (BBWC) or when the
 database's own WAL provides sufficient recovery guarantees.

XFS with default mount options outperforms ext4 by 6.2%. The difference comes from XFS’s metadata allocation strategy. XFS uses delayed allocation and extent-based addressing, requiring fewer metadata I/O operations per data write. For a WAL volume with purely sequential appends, this advantage is small but consistent.

Tablespace Placement

PostgreSQL tablespaces allow placing individual tables and indexes on specific storage volumes. The strategy: put hot, frequently-accessed data on fast storage. Put cold, rarely-accessed data on cheap storage.

The content platform’s PostgreSQL database has clear hot and cold zones:

Table/Index               Size     Reads/sec   Write/sec   Temperature
------------------------  -------  ---------   ---------   -----------
articles (table)          12 GB      4,200       120        Hot
articles_pkey (index)      1.2 GB    4,200         0        Hot
articles_cat_pub (index)   800 MB    3,100         0        Hot
analytics_events (table)  48 GB        20       4,200       Write-hot
analytics_events_ts (idx)  6 GB        20       4,200       Write-hot
analytics_daily (table)    2 GB       800         24        Warm
subscriptions (table)      50 MB       10          2        Cold
old_articles (table)      28 GB         3          0        Cold
-- Create tablespaces on different volumes
CREATE TABLESPACE fast_data LOCATION '/mnt/nvme-data/pg_tblspc';
CREATE TABLESPACE fast_wal  LOCATION '/mnt/nvme-wal/pg_tblspc';
CREATE TABLESPACE bulk_data LOCATION '/mnt/gp3/pg_tblspc';

-- Move hot tables and indexes to fast NVMe
ALTER TABLE articles SET TABLESPACE fast_data;
ALTER INDEX articles_pkey SET TABLESPACE fast_data;
ALTER INDEX articles_cat_pub SET TABLESPACE fast_data;

-- Keep write-heavy analytics on fast storage (needs fsync performance)
ALTER TABLE analytics_events SET TABLESPACE fast_data;
ALTER INDEX analytics_events_ts SET TABLESPACE fast_data;

-- Move cold data to cheap gp3 storage
ALTER TABLE old_articles SET TABLESPACE bulk_data;
ALTER TABLE subscriptions SET TABLESPACE bulk_data;

Moving a table to a different tablespace requires an ACCESS EXCLUSIVE lock. The table is inaccessible during the move. For the 28GB old_articles table on NVMe, the move takes about 15 seconds. Schedule this during low-traffic windows.

Measured impact of tablespace placement:

Before (everything on one NVMe volume):
  Article query avg latency:     0.14ms
  Analytics INSERT avg latency:  0.18ms

After (hot on NVMe, cold on gp3):
  Article query avg latency:     0.14ms  (unchanged, still on NVMe)
  Analytics INSERT avg latency:  0.18ms  (unchanged, still on NVMe)
  Cold query (old_articles):     2.8ms   (was 0.22ms, now on gp3)

  NVMe volume space freed:       28 GB
  gp3 cost for old_articles:     $2.24/month (28GB * $0.08/GB)

  Trade-off: 12.7x slower cold queries, $2.24/month savings, 28GB NVMe freed

The trade-off is acceptable because cold queries on old_articles happen 3 times per day. The freed NVMe space extends the runway before the next storage upgrade.

EBS Volume Types: A Cost-Performance Map

AWS EBS offers five volume types. Each occupies a different point on the cost-performance curve:

EBS volume type comparison (500GB, us-east-1, May 2026):

Type        Baseline    Max       Baseline    Max       fsync     $/month
            IOPS        IOPS      Throughput  Tput      latency
--------    --------    ------    ----------  ------    -------   -------
gp3         3,000       16,000    125 MB/s    1,000     1.8ms      $40
gp2         1,500*      3,000*    128 MB/s    250       2.1ms      $50
io1         --          64,000    --          1,000     0.5ms     $665
io2         --          64,000    --          1,000     0.45ms    $665
io2 BE**    --         256,000    --          4,000     0.2ms    $1,280
st1         --            500     --            500     N/A***     $23
sc1         --            250     --            250     N/A***     $13

* gp2 IOPS scale with volume size: 3 IOPS/GB, burst to 3,000
** Block Express, requires Nitro-based R5b/R6i instances
*** HDD types not suitable for fsync-dependent workloads

gp3 Tuning

gp3 is the default choice and the wrong choice for database WAL. But it works for read-heavy workloads where the page cache absorbs most reads and the remaining I/O is sporadic.

gp3 allows independent IOPS and throughput provisioning above the baseline:

# Provision additional IOPS for gp3 (up to 16,000)
aws ec2 modify-volume --volume-id vol-0123456789abcdef0 \
    --iops 6000 --throughput 250

# Cost: base $40 + ($0.005 * 3000 additional IOPS) + ($0.04 * 125 additional MB/s)
#      = $40 + $15 + $5 = $60/month

The provisioned IOPS on gp3 do not change fsync latency. They change the maximum sustained IOPS rate. Each individual fsync still takes 1.8ms because the latency is dominated by the network round trip to the EBS backend. Provisioning more IOPS allows more of those 1.8ms operations to happen in parallel.

gp3 IOPS provisioning effect on PostgreSQL (pgbench -c 16):

Provisioned IOPS    Measured TPS    Avg Commit Latency
----------------    ------------    ------------------
3,000 (baseline)       4,200             3.2ms
6,000                  7,800             2.8ms
10,000                11,400             2.4ms
16,000                14,200             2.1ms

Note: commit latency floor is ~1.8ms regardless of IOPS.
TPS scales because group commit amortizes fsync across
more concurrent transactions at higher IOPS.

io2 Block Express for Low-Latency Requirements

io2 Block Express volumes provide sub-millisecond fsync latency with high IOPS capacity. They require Nitro-based instances (R5b, R6i, or newer) that support the EBS Block Express architecture.

io2 Block Express configuration for the content platform:

Volume: 500GB, 64,000 IOPS provisioned
Instance: r6i.xlarge (4 vCPU, 32GB RAM)
Cost: $1,280/month (volume) + $0 (IOPS included up to 64K for io2)

Measured with pgbench -c 16:
  TPS:              38,400
  Avg commit lat:   0.38ms
  P99 commit lat:   0.92ms
  fsync avg:        0.20ms

Comparison:
  vs gp3:    9.1x TPS, 8.4x lower commit latency, 32x cost
  vs NVMe:   0.66x TPS, 2.1x higher commit latency, 14x cost

io2 Block Express sits between gp3 and local NVMe. It costs more than both but provides EBS durability (3-AZ replication) with acceptable latency. This is the right choice when the operational complexity of local instance stores (streaming replication, WAL archiving, automated snapshots) outweighs the cost premium.

Cloud Storage Latency Anatomy

EBS latency has three components that do not exist in local storage:

EBS I/O path (gp3):

Application
  | write() syscall            ~1 us
  v
Kernel block layer
  | NVMe-over-Fabrics submit   ~5 us
  v
EC2 Nitro Card (hardware)
  | Encrypt (AES-256-XTS)      ~3 us
  | Network packet assembly     ~2 us
  v
Network (to EBS backend)
  | Round trip                  ~200-800 us
  v
EBS Storage Server
  | Flash write                 ~30 us
  | Replication (2 AZs)        ~500-1500 us
  | Commit ack                  ~5 us
  v
Network (return)
  | Round trip                  ~200-800 us
  v
Nitro Card
  | Completion interrupt        ~3 us
  v
Kernel
  | Completion processing       ~2 us
  v
Application
  | fsync returns

Total: 950-3120 us (avg ~1800 us for gp3)

The dominant cost is network round trip plus cross-AZ replication. The actual flash write on the EBS server takes 30 microseconds, nearly identical to a local NVMe. The remaining 1,770 microseconds is infrastructure overhead for durability.

EBS Burst Credits

gp3 volumes have a baseline IOPS of 3,000. Unlike gp2, gp3 does not have a burst credit mechanism for IOPS. The 3,000 IOPS is a hard floor and ceiling unless you provision additional IOPS.

However, gp3 does have burst behavior for throughput. The baseline is 125 MB/s, and short bursts can reach 250 MB/s. This matters for sequential scans during backups and VACUUM.

gp3 throughput behavior during pg_dump:

Time (seconds)    Throughput    Note
--------------    ----------   ----
0-30              248 MB/s     Burst (above baseline)
30-60             246 MB/s     Burst continuing
60-90             125 MB/s     Burst exhausted, baseline
90+               125 MB/s     Sustained baseline

pg_dump of 12GB articles table:
  With burst:    ~50 seconds (first 7.5GB at 248, rest at 125)
  At baseline:   ~96 seconds (12GB / 125 MB/s)
  Actual:        ~62 seconds (mixed)

Monitoring EBS Performance

CloudWatch metrics reveal when EBS is the bottleneck:

Key CloudWatch metrics for EBS performance diagnosis:

VolumeReadOps / VolumeWriteOps
  IOPS consumed. Compare to provisioned IOPS.
  If consistently at provisioned limit: provision more or upgrade volume type.

VolumeQueueLength
  Average number of I/O operations waiting.
  Target: < 1 for gp3, < 4 for io2.
  > 4 means the volume is saturated.

VolumeThroughputPercentage (io2 only)
  Percentage of provisioned throughput used.
  > 90% sustained: provision more throughput.

BurstBalance (gp2 only, not gp3)
  Remaining burst credits as percentage.
  If drops to 0: volume throttles to baseline.

VolumeReadLatency / VolumeWriteLatency
  Average per-operation latency.
  gp3 expected: 1-3ms.
  io2 expected: 0.2-1ms.
  > 5ms: network congestion or EBS service degradation.
-- PostgreSQL-side monitoring: check for I/O wait
SELECT
    datname,
    blk_read_time,
    blk_write_time,
    blks_read,
    blks_hit,
    round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 1) AS cache_hit_pct
FROM pg_stat_database
WHERE datname = 'content_platform';

-- If blk_read_time is high relative to query count, storage is the bottleneck
-- If cache_hit_pct < 99%, shared_buffers is undersized or working set exceeds RAM

Filesystem Configuration for Database Volumes

The filesystem between PostgreSQL and the block device adds measurable overhead. The configuration choices:

Mount Options

# Database data volume
mount -o noatime,nodiratime,discard /dev/nvme0n1p1 /mnt/pgdata

# noatime:    Skip updating access time on every read. Without this,
#             every SELECT that reads a page triggers a metadata write.
#             Measured impact: 3-5% IOPS improvement on read-heavy workloads.

# nodiratime: Same as noatime but for directory access times.

# discard:    Enable online TRIM for SSD garbage collection.
#             Alternative: scheduled fstrim via cron (lower runtime overhead).

ext4 Tuning

# Disable ext4 journal for WAL volume (PostgreSQL's WAL provides recovery)
# WARNING: Only safe for dedicated WAL volumes, not data volumes
mkfs.ext4 -O ^has_journal /dev/nvme1n1

# Set stripe width to match RAID or NVMe internal parallelism
mkfs.ext4 -E stride=16,stripe_width=64 /dev/nvme0n1p1

# Measured improvement from journal-less ext4 on WAL volume:
# fsync latency: 34us -> 26us (-24%)
# pgbench TPS: 58,200 -> 62,800 (+7.9%)

XFS Tuning

# XFS with allocation group sizing for parallel writers
mkfs.xfs -d agcount=16 /dev/nvme0n1p1

# Align to NVMe sector size (most modern drives use 4KB)
mkfs.xfs -s size=4096 /dev/nvme0n1p1

# XFS on WAL volume with optimized log
mkfs.xfs -l size=128m,lazy-count=1 /dev/nvme1n1

Putting It Together: The Content Platform Storage Architecture

The final storage layout for the content platform, balancing performance, durability, and cost:

Volume Layout:

/dev/nvme0n1 (local NVMe, 1.9TB, included with i3en.xlarge)
  Partition 1: /mnt/pgdata     (500GB, XFS, noatime,discard)
    - PostgreSQL data directory
    - Tablespace: fast_data (articles, analytics_events, indexes)
    - effective_io_concurrency = 200
    - random_page_cost = 1.1

  Partition 2: /mnt/qdrant     (200GB, XFS, noatime,discard)
    - Qdrant vector database data
    - HNSW index files (memory-mapped)

/dev/nvme1n1 (local NVMe, 1.9TB, second drive on i3en.xlarge)
  Partition 1: /mnt/pgwal      (50GB, XFS, noatime,discard)
    - PostgreSQL WAL (symlinked from pg_wal)
    - Dedicated to sequential writes + fsync
    - No other workload shares this partition

  Partition 2: /mnt/temp       (200GB, XFS, noatime,discard)
    - PostgreSQL temp_tablespaces
    - Sort spill files, hash join overflow
    - temp_file_limit = 10GB per query

/dev/ebs-gp3 (EBS gp3, 500GB, 3000 IOPS baseline)
  /mnt/cold                    (500GB, ext4, noatime)
    - Tablespace: bulk_data (old_articles, subscriptions)
    - PostgreSQL base backups (retained 7 days)
    - Application logs

Durability:
  - Streaming replication to standby (also i3en.xlarge, separate AZ)
  - WAL archiving to S3 every 60 seconds
  - pg_basebackup to S3 every 6 hours
  - Qdrant snapshot to S3 every 12 hours (index is rebuildable)
Performance summary (pgbench -c 16 -j 4 -T 300):

Metric                   gp3 (before)    Optimized layout    Improvement
---------------------    ------------    ----------------    -----------
TPS                          4,200            62,800           15.0x
Avg commit latency           3.2ms            0.22ms           14.5x
P99 commit latency           7.2ms            0.58ms           12.4x
WAL fsync latency            1.8ms            0.026ms          69.2x
Article query P99            3.4ms            0.32ms           10.6x
Analytics INSERT P99         5.8ms            0.48ms           12.1x
Monthly storage cost           $40              $94             2.4x

Cost per 1000 TPS:         $9.52/mo          $1.50/mo          6.4x better

The 15x throughput improvement and 14.5x latency reduction cost $54/month more. The cost per transaction dropped by 6.4x. The storage is no longer the ceiling. CPU at 78% utilization during the benchmark is now the next bottleneck, which is the subject of earlier chapters in this book.

The storage under your database is a multiplier on every other optimization. Query tuning, connection pooling, caching, and indexing all reduce the number of I/O operations. But each remaining operation takes as long as the storage device allows. A 10x reduction in query I/O on gp3 storage (1.8ms fsync) may be slower than the unoptimized query on local NVMe (0.035ms fsync). Fix the storage first. Then optimize the queries. The results compound.