Skip to main content
data systems from the ground up

Single-Leader Replication and the Lag You Can Measure

4 min read Chapter 11 of 36

Single-Leader Replication and the Lag You Can Measure

The Black Box

The logistics platform uses a PostgreSQL primary for writes and two replicas for read queries. The operations team reports that “the replicas are in sync.” What does “in sync” mean? Within 1 byte? Within 1 second? Within 1 minute? Without measuring lag, “in sync” is a feeling, not a fact.

The Mechanism

PostgreSQL streaming replication sends WAL records from the primary to replicas as they are generated. The process involves four stages, each of which can fall behind:

  1. Sent: The primary has sent the WAL record over the network. Network latency determines this delay.
  2. Written: The replica has received the record and written it to its WAL file on disk. Disk I/O speed on the replica determines this delay.
  3. Flushed: The replica has called fsync on the WAL file. The record is durable on the replica.
  4. Replayed: The replica has applied the WAL record to its data pages. The record is visible to queries. CPU and I/O on the replica determine this delay.

Each stage introduces latency. The total lag is the difference between the primary’s current WAL position and the replica’s replayed WAL position.

-- Concept: detailed replication lag monitoring
-- Run on the primary to see each stage of lag per replica

SELECT
    client_addr,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
    pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
    pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS total_lag_bytes
FROM pg_stat_replication;

-- client_addr | send_lag | write_lag | flush_lag | replay_lag | total_lag
-- 10.0.1.12   |     0    |     0     |     0     |   524288   |   524288
-- 10.0.1.13   |     0    |     0     |     0     |  8388608   |  8388608

-- Replica 10.0.1.12: 512KB replay lag. Send, write, flush are caught up.
-- The bottleneck is applying WAL records to data pages (CPU or I/O bound).
-- Replica 10.0.1.13: 8MB replay lag. Same bottleneck, worse.

When the replay lag is the dominant component, the replica’s CPU or disk is the bottleneck. Common causes: heavy compution on the replica (running analytical queries), slower storage than the primary, or long-running transactions on the replica that block WAL replay due to recovery conflicts.

The Observable Consequence

The logistics platform’s warehouse dashboard reads from replica 10.0.1.13. The dashboard shows “last scan” timestamps for packages. With 8MB of replay lag at a WAL generation rate of 2MB/s, the dashboard is approximately 4 seconds behind the primary.

A warehouse worker scans package PKG-40291 at timestamp 14:22:03. The primary records it immediately. The dashboard, reading from the lagged replica, shows PKG-40291’s status as “AWAITING_SCAN” until approximately 14:22:07. During those 4 seconds, the worker sees stale data.

Read-Your-Own-Writes

The solution: after a write, track the LSN of that write, and ensure subsequent reads from that user go to a replica that has replayed past that LSN.

// Concept: read-your-own-writes using LSN tracking
// After writing to the primary, record the WAL position.
// Before reading from a replica, wait until the replica has caught up.

// After INSERT on primary:
String lsn;
try (var stmt = primaryConn.createStatement()) {
    var rs = stmt.executeQuery("SELECT pg_current_wal_lsn()::text");
    rs.next();
    lsn = rs.getString(1);  // e.g., "0/A4000160"
}
// Store lsn in the user's session or a cookie

// Before SELECT on replica:
try (var stmt = replicaConn.createStatement()) {
    // Wait up to 5 seconds for the replica to catch up to the write LSN
    var rs = stmt.executeQuery(
        "SELECT pg_last_wal_replay_lsn() >= '" + lsn + "'::pg_lsn");
    rs.next();
    if (!rs.getBoolean(1)) {
        // Replica has not caught up. Fall back to primary for this read.
        return readFromPrimary(packageId);
    }
}
// Replica is caught up. Read from replica.
return readFromReplica(packageId);

This pattern trades code complexity for consistency. The application must track write LSNs, check replica positions, and fall back to the primary when the replica is behind. For the logistics dashboard, this complexity is justified because stale reads cause operational errors.

The Decision Rule

Measure replication lag continuously using pg_stat_replication. Alert when total lag exceeds your staleness budget. For the logistics dashboard with a 1-second staleness budget, alert when replay_lag_bytes / wal_generation_rate > 1 second.

Implement read-your-own-writes only for workflows where a user writes and immediately reads. Do not implement it globally. The majority of read traffic has no relationship to a recent write and can tolerate seconds of lag. Implementing read-your-own-writes for all queries defeats the purpose of read replicas by routing too many reads to the primary.

If replay lag is consistently high on a replica, the replica’s resources are insufficient. Upgrading the replica’s CPU or storage is cheaper than building application-level workarounds for lag.