Skip to main content
fast by design

WAL Pressure, Commit Frequency, and fsync Costs

10 min read Chapter 57 of 90

WAL Pressure, Commit Frequency, and fsync Costs

Section 1 showed that commit frequency dominates write throughput. This section explains why: the WAL (Write-Ahead Log) architecture, the fsync system call that guarantees durability, and the configuration knobs that trade durability for speed.

WAL Architecture

PostgreSQL guarantees durability through write-ahead logging: every change is written to WAL before the actual table data is modified. The flow:

  1. Backend writes WAL record to wal_buffers (shared memory)
  2. On COMMIT, the WAL writer flushes wal_buffers to disk (fsync)
  3. Only after fsync succeeds does COMMIT return success to client
  4. Background writer later writes dirty heap pages to data files

This guarantees that after a crash, PostgreSQL can replay WAL to reconstruct any committed transaction, even if heap pages were never written to disk.

WAL Buffer Mechanics

SHOW wal_buffers;        -- Default: -1 (auto, typically 64MB at 1GB shared_buffers)
SHOW wal_writer_delay;   -- Default: 200ms
SHOW wal_segment_size;   -- Default: 16MB

The WAL buffer is a circular buffer in shared memory. Backends write WAL records sequentially. When a backend commits:

  • If synchronous_commit = on (default): the backend calls fsync() on the WAL file, waits for disk confirmation
  • If synchronous_commit = off: the backend returns immediately, WAL is flushed by the WAL writer within wal_writer_delay (200ms)

Measuring fsync Cost

The raw cost of fsync varies by storage:

-- PostgreSQL's built-in fsync test
-- Run from command line, not SQL
-- pg_test_fsync (ships with PostgreSQL)

Typical results:

Storage Type     | fsync latency (avg) | fsync latency (P99) | IOPS (fsync)
NVMe SSD         |         70 us       |         180 us      |    14,285
SATA SSD         |        250 us       |         800 us      |     4,000
SAS 15K RPM      |       2.1 ms        |        4.8 ms       |       476
SATA 7200 RPM    |       4.2 ms        |        12 ms        |       238
EBS gp3 (AWS)    |        500 us       |        2.1 ms       |     3,000

These numbers directly cap your maximum commit rate:

Max commits/sec = 1 / fsync_latency_avg
NVMe:      14,285 commits/sec
SATA SSD:   4,000 commits/sec
Spinning:     238 commits/sec

If you INSERT with autocommit, your maximum throughput on spinning disk is 238 rows/sec. No application optimization can exceed this hardware limit.

Commit Frequency Benchmark

Measuring throughput as a function of commit frequency (10,000 rows, NVMe storage):

// Benchmark harness: vary commit interval
public void benchmarkCommitFrequency(Connection conn, List<Article> articles,
                                      int commitEveryN) throws SQLException {
    conn.setAutoCommit(false);
    String sql = """
        INSERT INTO articles (external_id, title, body, category_id, source_id,
                             published_at, word_count)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        """;
    int count = 0;
    try (PreparedStatement ps = conn.prepareStatement(sql)) {
        for (Article a : articles) {
            ps.setString(1, a.externalId());
            ps.setString(2, a.title());
            ps.setString(3, a.body());
            ps.setInt(4, a.categoryId());
            ps.setInt(5, a.sourceId());
            ps.setObject(6, a.publishedAt());
            ps.setInt(7, a.wordCount());
            ps.executeUpdate();
            count++;
            if (count % commitEveryN == 0) {
                conn.commit();
            }
        }
        conn.commit();
    }
}

Results:

Commit Every N | Commits | fsync Calls | Time (sec) | Rows/sec | WAL MB
1              | 10,000  |      10,000 |     14.2   |      704 |    312
10             |  1,000  |       1,000 |      2.8   |    3,571 |    305
50             |    200  |         200 |      1.6   |    6,250 |    299
100            |    100  |         100 |      1.4   |    7,143 |    298
500            |     20  |          20 |      1.1   |    9,091 |    296
1000           |     10  |          10 |      1.0   |   10,000 |    295
5000           |      2  |           2 |      0.93  |   10,753 |    294
10000          |      1  |           1 |      0.91  |   10,989 |    293

The relationship is clear: throughput scales inversely with fsync count until the bottleneck shifts to heap page writes (around 500-1000 rows per commit on this hardware).

synchronous_commit Levels

PostgreSQL offers five durability levels:

-- Most durable (default): wait for WAL fsync
SET synchronous_commit = on;

-- Wait for WAL write but not fsync (OS page cache)
SET synchronous_commit = off;

-- Remote replication levels (not covered here)
-- remote_write, remote_apply, on (with synchronous standby)

The critical choice for single-server performance:

-- Per-transaction override
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO view_counts (article_id, count_delta) VALUES (42, 1);
COMMIT; -- returns immediately, WAL flushed within 200ms

Benchmark with synchronous_commit = off (10,000 rows, autocommit):

synchronous_commit = on:   704 rows/sec (14.2s)
synchronous_commit = off: 8,333 rows/sec (1.2s)
Improvement:              11.8x

With synchronous_commit = off, COMMIT returns without waiting for fsync. The WAL writer flushes within wal_writer_delay (200ms default). In a crash, up to 200ms of committed transactions can be lost.

When to Use synchronous_commit = off

The content platform uses this setting for view count increments:

// FAST: view counts are reconstructable from application logs
public void incrementViewCount(Connection conn, long articleId) throws SQLException {
    try (Statement stmt = conn.createStatement()) {
        stmt.execute("SET LOCAL synchronous_commit = off");
    }
    try (PreparedStatement ps = conn.prepareStatement(
            "INSERT INTO view_events (article_id, viewed_at) VALUES (?, now())")) {
        ps.setLong(1, articleId);
        ps.executeUpdate();
    }
}

Appropriate for:

  • Counters and analytics events (reconstructable from logs)
  • Cache invalidation markers (re-derivable)
  • Session activity timestamps (approximate is acceptable)

Never appropriate for:

  • Financial transactions
  • User-facing data modifications
  • Operations that cannot be replayed

WAL Generation Rate Monitoring

Track WAL pressure in real time:

-- Current WAL position and rate
SELECT
    pg_current_wal_lsn() AS current_lsn,
    pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / (1024*1024*1024.0) AS total_wal_gb;

-- WAL generated in last interval (run twice, compute delta)
CREATE OR REPLACE FUNCTION wal_rate_mbps() RETURNS TABLE(wal_mb_per_sec numeric) AS $$
DECLARE
    lsn1 pg_lsn;
    lsn2 pg_lsn;
BEGIN
    lsn1 := pg_current_wal_lsn();
    PERFORM pg_sleep(5);
    lsn2 := pg_current_wal_lsn();
    RETURN QUERY SELECT round(pg_wal_lsn_diff(lsn2, lsn1) / 5.0 / (1024*1024), 2);
END;
$$ LANGUAGE plpgsql;

SELECT * FROM wal_rate_mbps();

WAL rate during different workloads on the content platform:

Workload                          | WAL Rate (MB/sec)
Idle                              |        0.1
Normal read traffic               |        0.3
Article ingestion (10k/hr)        |        1.2
Bulk reindex (100k articles)      |       28.4
View count updates (peak)         |        4.7

Checkpoint Tuning for Bulk Loads

Checkpoints flush all dirty pages to disk. During bulk loads, frequent checkpoints cause I/O storms:

SHOW checkpoint_timeout;          -- Default: 5min
SHOW max_wal_size;                -- Default: 1GB
SHOW checkpoint_completion_target; -- Default: 0.9

When WAL accumulates to max_wal_size, a checkpoint is forced. During bulk ingestion generating 28 MB/sec of WAL, the default 1GB limit triggers a checkpoint every 36 seconds. Each checkpoint flushes hundreds of dirty pages, causing I/O spikes.

For bulk load scenarios:

-- Temporary settings during bulk ingestion
ALTER SYSTEM SET max_wal_size = '4GB';
ALTER SYSTEM SET checkpoint_timeout = '15min';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
SELECT pg_reload_conf();

Benchmark: 100,000 article bulk load:

max_wal_size = 1GB (default):
  Time: 42 seconds
  Checkpoints: 7
  I/O spikes: 7 (each causing 200ms latency increase)

max_wal_size = 4GB:
  Time: 31 seconds
  Checkpoints: 1 (at end)
  I/O spikes: 1
  Improvement: 26%

The reduced checkpoint frequency means less I/O contention between the WAL writer and the checkpointer. Read queries running concurrently experience lower latency variance.

wal_buffers Sizing

The default wal_buffers auto-sizing (1/32 of shared_buffers) is adequate for most workloads. Under heavy write load, undersized WAL buffers force premature flushes:

-- Check if WAL buffer contention exists
SELECT * FROM pg_stat_wal;
wal_records | wal_fpi | wal_bytes    | wal_buffers_full | wal_write | wal_sync
  2847291   |  18423  | 4271948288   |              0   |    28471  |   28471

If wal_buffers_full > 0, backends are waiting for WAL buffer space. Increase wal_buffers:

ALTER SYSTEM SET wal_buffers = '64MB';

On the content platform with 16GB shared_buffers, the auto-calculated value is 512MB (capped at a reasonable value). The wal_buffers_full counter stays at zero even during bulk loads.

Full-Page Writes and Their Cost

After a checkpoint, the first modification to any page writes the entire 8KB page to WAL (a “full-page image” or FPI). This protects against torn page writes. The cost:

-- Monitor FPI ratio
SELECT
    wal_records,
    wal_fpi,
    round(100.0 * wal_fpi / wal_records, 2) AS fpi_pct,
    pg_size_pretty(wal_bytes) AS total_wal,
    pg_size_pretty(wal_fpi_bytes) AS fpi_wal
FROM pg_stat_wal;

During article ingestion:

wal_records | wal_fpi | fpi_pct | total_wal | fpi_wal
   284729   |  18423  |   6.47  |   4.1 GB  |  287 MB

6.47% of WAL records are full-page images, but they account for a disproportionate share of WAL volume (287 MB out of 4.1 GB = 7%). This is because FPI records are 8KB each while regular records average 100-500 bytes.

Reducing FPI impact:

-- Increase checkpoint distance to reduce FPI frequency
ALTER SYSTEM SET max_wal_size = '8GB';

-- Or for non-critical data, disable (DANGEROUS: risk of corruption on power loss)
-- ALTER SYSTEM SET full_page_writes = off;  -- NEVER in production

The Content Platform WAL Configuration

Final tuned configuration for 10k articles/hour ingestion with concurrent read traffic:

-- postgresql.conf tuning for content platform
-- Write path
wal_buffers = '64MB'
wal_writer_delay = '200ms'
synchronous_commit = on               -- default for transactions
max_wal_size = '4GB'                  -- reduce checkpoint frequency
checkpoint_timeout = '10min'
checkpoint_completion_target = 0.9

-- Per-session override for analytics writes
-- SET LOCAL synchronous_commit = off; (in application code)

Monitoring query to validate WAL pressure stays within bounds:

-- Run periodically, alert if wal_rate exceeds threshold
WITH wal_stats AS (
    SELECT
        pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') AS current_bytes,
        extract(epoch FROM now()) AS current_time
)
SELECT
    pg_size_pretty(
        (SELECT wal_bytes FROM pg_stat_wal)
    ) AS total_wal_since_reset,
    (SELECT wal_buffers_full FROM pg_stat_wal) AS buffer_pressure,
    (SELECT wal_sync FROM pg_stat_wal) AS total_fsyncs,
    pg_size_pretty(pg_wal_lsn_diff(
        pg_current_wal_lsn(),
        (SELECT redo_lsn FROM pg_control_checkpoint())
    )) AS wal_since_last_checkpoint;

Expected output during normal operation:

 total_wal_since_reset | buffer_pressure | total_fsyncs | wal_since_last_checkpoint
-----------------------+-----------------+--------------+---------------------------
 12 GB                 |               0 |        42871 | 847 MB

Zero buffer pressure, checkpoint distance under max_wal_size, and fsync count correlating with commit count (not row count) confirms the pipeline is operating within design parameters.

Commit Group Optimization

PostgreSQL automatically groups commits from concurrent backends. When multiple backends commit simultaneously, the first to request fsync performs it for all pending commits:

-- Enable commit delay to increase grouping probability
ALTER SYSTEM SET commit_delay = 10;         -- microseconds to wait
ALTER SYSTEM SET commit_siblings = 5;       -- minimum concurrent transactions
SELECT pg_reload_conf();

When at least commit_siblings transactions are active, a committing backend waits commit_delay microseconds before fsyncing, hoping other backends will commit in that window and share the fsync.

Benchmark with 8 concurrent writers (each inserting 1000 rows with commit-per-row):

commit_delay = 0 (default):
  Total fsyncs: 8000
  Combined throughput: 5,128 rows/sec

commit_delay = 10us, commit_siblings = 5:
  Total fsyncs: 2,847
  Combined throughput: 7,692 rows/sec
  Improvement: 50%

The commit delay grouped approximately 3 commits per fsync on average. This optimization helps when many connections write small transactions concurrently, exactly the pattern for view count updates on the content platform.

Summary: The fsync Budget

Every write workload has an fsync budget determined by storage hardware:

Available fsyncs/sec (NVMe):  14,285
Minus: checkpoint fsyncs:        -10  (every 10 min)
Minus: WAL archiver fsyncs:      -50  (if archiving enabled)
Available for commits:        14,225

Content platform allocation:
  Article ingestion:     20 commits/hour  (negligible)
  View count writes:  3,600 commits/hour  (with sync_commit=off: 0)
  Analytics batches:    120 commits/hour  (negligible)
  Read transactions:      0 fsyncs       (read-only, no WAL)
  Total:              3,740 fsyncs/hour = 1.04 fsyncs/sec

Utilization: 1.04 / 14,225 = 0.007%

The content platform uses less than 0.01% of available fsync capacity. This extreme headroom results from three design decisions: batch commits for ingestion, synchronous_commit = off for view counts, and read-only transactions for content serving. The remaining capacity handles traffic spikes of 100x without degradation.