Skip to main content
data systems from the ground up

PostgreSQL Checkpoints and the I/O Storm You Did Not Expect

5 min read Chapter 8 of 36

PostgreSQL Checkpoints and the I/O Storm You Did Not Expect

The Black Box

The logistics platform’s package event ingestion runs at a steady 2,000 writes per minute. Query latency is stable at 2ms. Every 5 minutes, query latency spikes to 50ms for 30 seconds, then returns to normal. The application code has not changed. No queries are running differently. The cause is invisible to application-level monitoring.

The Mechanism

PostgreSQL’s checkpoint process writes every dirty page from the shared buffer pool to disk. A “dirty page” is a data page that was modified in memory but not yet written to the underlying data file.

During normal operation, modifications happen in the buffer pool. The WAL guarantees recoverability. The actual data files on disk may be many seconds behind the buffer pool. The checkpoint brings them in sync.

Two triggers cause a checkpoint:

  1. Time-based: checkpoint_timeout (default: 5 minutes) has elapsed since the last checkpoint completed.
  2. WAL-volume-based: WAL usage since the last checkpoint exceeds max_wal_size (default: 1GB).

When the checkpoint triggers, PostgreSQL must write every dirty page to disk. If the buffer pool is 8GB and 40% of it is dirty, the checkpoint must write 3.2GB to disk.

-- Concept: monitoring checkpoint behavior
-- pg_stat_bgwriter reveals checkpoint frequency and buffer writes

SELECT
    checkpoints_timed,       -- Checkpoints triggered by timeout
    checkpoints_req,         -- Checkpoints triggered by WAL volume
    buffers_checkpoint,      -- Pages written during checkpoints
    buffers_clean,           -- Pages written by background writer
    buffers_backend,         -- Pages written by backends (bad: means bgwriter is behind)
    checkpoint_write_time,   -- Total ms spent writing during checkpoints
    checkpoint_sync_time     -- Total ms spent in fsync during checkpoints
FROM pg_stat_bgwriter;

-- Healthy system:
-- checkpoints_timed: 1440 (one every 5 min over 5 days)
-- checkpoints_req: 3     (rarely triggered by WAL volume)
-- buffers_backend: 128   (backends almost never write directly)

-- Unhealthy system:
-- checkpoints_req: 890   (WAL volume triggers dominate)
-- buffers_backend: 45000 (backends forced to write dirty pages themselves)

When buffers_backend is high, backends (the processes handling your queries) are being forced to write dirty pages to disk because the background writer and checkpoint process cannot keep up. This is the source of the latency spike: your SELECT query has to wait while the backend evicts a dirty page to make room in the buffer pool.

Spread Checkpoints

The checkpoint_completion_target parameter (default: 0.9) controls how aggressively PostgreSQL writes dirty pages during a checkpoint. A value of 0.9 means PostgreSQL spreads the writes over 90% of the checkpoint interval (4.5 minutes out of a 5-minute interval). This converts a burst of I/O into a steady trickle.

Without spread (completion_target = 0.0): all dirty pages are written as fast as possible, saturating disk I/O for seconds and spiking query latency.

With spread (completion_target = 0.9): dirty pages are written at a metered rate, and the I/O impact on query latency is minimal.

The Observable Consequence

The logistics platform configuration before tuning:

shared_buffers = 256MB        # Small buffer pool
checkpoint_timeout = 5min     # Default
max_wal_size = 1GB            # Default
checkpoint_completion_target = 0.5  # Writes in 2.5 minutes

With this configuration, at 2,000 events/minute plus 3 indexes per table, the WAL generates approximately 1.7MB/second. After 5 minutes, approximately 500MB of WAL has accumulated. The checkpoint writes roughly 200MB of dirty pages in 2.5 minutes (80MB/min), which is noticeable but not saturating on NVMe.

The problem occurs during batch imports. When the logistics platform imports 500,000 historical events, the WAL rate spikes to 50MB/second. max_wal_size (1GB) is hit in 20 seconds, triggering a forced checkpoint. The forced checkpoint writes hundreds of megabytes of dirty pages as fast as possible, saturating I/O and spiking query latency for online users.

The Code

-- Concept: checkpoint configuration for write-heavy workloads
-- Tuned for the logistics platform

-- Increase buffer pool to reduce backend writes
ALTER SYSTEM SET shared_buffers = '4GB';

-- Increase WAL size to avoid forced checkpoints during bulk loads
ALTER SYSTEM SET max_wal_size = '8GB';

-- Keep checkpoint interval reasonable
ALTER SYSTEM SET checkpoint_timeout = '10min';

-- Spread checkpoint writes over 90% of the interval
ALTER SYSTEM SET checkpoint_completion_target = 0.9;

-- Reload configuration
SELECT pg_reload_conf();

-- Verify: monitor the next checkpoint
SELECT pg_current_wal_lsn();
-- Wait for a checkpoint, then check pg_stat_bgwriter

The Decision Rule

If checkpoints_req in pg_stat_bgwriter is growing faster than checkpoints_timed, your WAL volume is exceeding max_wal_size between time-based checkpoints. Increase max_wal_size until time-based checkpoints dominate.

If buffers_backend is non-trivial (more than 1% of total buffer writes), either shared_buffers is too small or the background writer is not keeping pace. Increase shared_buffers and verify that bgwriter_lru_maxpages is sufficient.

If query latency spikes correlate with checkpoint timing (every checkpoint_timeout seconds), increase checkpoint_completion_target to 0.9 to spread the writes. If 0.9 is already set and latency still spikes, the disk I/O subsystem is saturated. The fix is faster storage, not configuration tuning.