Replication: Keeping Copies in Sync Without Losing Your Mind

Chapters 1 through 3 built a storage system on a single machine: an append-only log, indexes for fast reads, and a WAL for durability. If that machine’s disk fails, the data is gone. If the machine goes down for maintenance, the service is unavailable.

Replication keeps copies of the data on multiple machines. It solves two problems: durability (survive hardware failure) and availability (serve reads while the primary is down). It creates one new problem: keeping the copies consistent.

The mechanism is the WAL from Chapter 3. PostgreSQL ships WAL records to replicas. Kafka replicates partition log segments to follower brokers. The append-only log is not just a crash recovery mechanism. It is the replication protocol.

Single-Leader Replication

The dominant replication topology is single-leader (also called primary-replica or master-slave). One node accepts writes. It ships the writes to one or more replicas. Replicas serve read queries.

PostgreSQL Streaming Replication

PostgreSQL’s streaming replication ships WAL records from the primary to replicas in near-real-time:

The primary writes a WAL record (as described in Chapter 3).
The WAL sender process streams the record to connected replicas over a TCP connection.
Each replica’s WAL receiver process writes the record to its own WAL.
The replica’s startup process applies the WAL record to its data pages.

-- Concept: monitoring replication lag on the primary
-- This query shows how far behind each replica is

SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS replay_lag_mb
FROM pg_stat_replication;

-- client_addr  | state     | replay_lag_bytes | replay_lag_mb
-- 10.0.1.12    | streaming | 1048576          | 1.0
-- 10.0.1.13    | streaming | 8388608          | 8.0

-- Replica 10.0.1.13 is 8MB behind the primary.
-- At a WAL generation rate of 2MB/s, that is 4 seconds of lag.

Kafka ISR (In-Sync Replicas)

Kafka replicates at the partition level. Each partition has a leader and zero or more followers. The ISR (In-Sync Replicas) set contains followers that are caught up to the leader within replica.lag.time.max.ms (default: 30 seconds).

# Concept: Kafka ISR and replication status
kafka-topics.sh --describe --topic package-events

# Topic: package-events  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
# Topic: package-events  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3
# Topic: package-events  Partition: 2  Leader: 3  Replicas: 3,1,2  Isr: 3,1,2

# Partition 1 has Replicas: 2,3,1 but Isr: 2,3
# Broker 1 has fallen out of the ISR for partition 1.
# It is more than 30 seconds behind the leader.
# If acks=all, producers writing to partition 1 will only wait
# for brokers 2 and 3 to acknowledge.

Replication Lag

Replication lag is the delay between a write on the leader and that write being available on a replica. It is not a bug. It is a fundamental property of asynchronous replication.

The logistics platform’s warehouse dashboard reads from a PostgreSQL replica to avoid loading the primary with dashboard queries. When a package is scanned at the warehouse, the event is written to the primary. The dashboard reads from the replica.

With 2 seconds of replication lag, a warehouse worker scans a package and immediately checks the dashboard. The dashboard shows the old status. The worker scans again, thinking the first scan failed. Now there are two duplicate events in the system.

-- Concept: measuring replication lag on the replica
-- Run this on the replica, not the primary

SELECT
    now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- replication_lag
-- 00:00:02.341

-- 2.3 seconds. The replica is showing data that is 2.3 seconds old.
-- For the warehouse dashboard, this causes visible stale reads.

The solutions to replication lag depend on the consistency requirement:

Read-your-own-writes: After a user performs a write, route their subsequent reads to the primary (or a replica known to have replayed past the write’s LSN). This prevents the “scan then check dashboard” problem but requires tracking which user wrote and which LSN they wrote at.

Monotonic reads: Ensure a single user’s reads always go to the same replica. This prevents seeing a newer value and then an older value on the next request (which happens when load balancing routes successive reads to different replicas with different lag).

Leader Election and Failover

Raft leader election state machine showing follower, candidate, and leader states with transitions

The Raft leader election state machine has three states. Followers receive heartbeats from the leader. When heartbeats stop (leader crash or network partition), a follower becomes a candidate and requests votes. A candidate that receives votes from a majority of nodes becomes the new leader. The timeout before a follower becomes a candidate is randomized to avoid split votes. During the election, no writes are accepted. This is the 5-15 second window of unavailability that applications observe during failover.

When the leader fails, a new leader must be elected. Raft (used by etcd, CockroachDB, and Kafka’s KRaft controller) is the consensus algorithm that handles this.

Raft is not covered here to prepare for a distributed systems exam. It is covered because understanding leader election explains a specific production behavior: why your database goes read-only for 5-15 seconds during a failover.

The relevant mechanics:

Followers receive periodic heartbeats from the leader.
If a follower does not receive a heartbeat within the election timeout (typically 1-5 seconds), it assumes the leader has failed.
The follower becomes a candidate and requests votes from other nodes.
If the candidate receives votes from a majority, it becomes the new leader.
The new leader begins accepting writes.

The unavailability window is: detection time (election timeout) + election time (one or more round trips for vote requests) + recovery time (the new leader may need to replay uncommitted WAL entries).

# Concept: Kafka KRaft controller election timeout
# These settings control how quickly Kafka detects and recovers from a controller failure

# server.properties on each broker:
controller.quorum.election.timeout.ms=1000
controller.quorum.fetch.timeout.ms=2000
controller.quorum.election.backoff.max.ms=1000

# Election timeout: 1 second before a follower starts an election
# Fetch timeout: 2 seconds before a follower considers the leader dead
# Backoff: randomized delay to avoid simultaneous elections (split votes)
# Expected failover time: 2-5 seconds

The Unacknowledged Write Problem

During failover, writes that were acknowledged by the old leader but not yet replicated may be lost. In PostgreSQL with asynchronous replication, the primary can acknowledge a commit and crash before the WAL record reaches the replica. The new primary does not have that transaction.

-- Concept: synchronous replication to prevent data loss during failover
-- Trade availability for consistency

-- On the primary:
ALTER SYSTEM SET synchronous_standby_names = 'replica1';
ALTER SYSTEM SET synchronous_commit = 'remote_apply';

-- synchronous_commit = remote_apply means:
-- The primary waits until the replica has APPLIED the WAL record
-- before returning success to the client.
-- Failover to the replica loses zero transactions.
-- Cost: every write is as slow as the replica's apply latency.
-- If the replica is 1ms away (same rack), the cost is +1ms per commit.
-- If the replica is 50ms away (different region), every commit takes +50ms.

The Decision Rule

Use asynchronous replication (PostgreSQL default, Kafka with acks=1) when availability matters more than zero-data-loss guarantees. Accept that failover may lose the most recent writes (bounded by the replication lag). This is appropriate for the logistics platform’s package tracking events, where the scanner device retries on failure.

Use synchronous replication (PostgreSQL with synchronous_commit = remote_apply, Kafka with acks=all) when losing a committed write is unacceptable. Inventory decrements, financial transactions, order confirmations. Accept the latency cost of waiting for the replica.

The append-only log from Chapter 1 is the replication protocol. PostgreSQL ships WAL records. Kafka replicates partition log segments. The log is what travels over the network. The index, the B-Tree, the SSTable, these are local data structures rebuilt from the log on each replica. The log is the source of truth. Replication is log shipping. Everything covered in Chapters 1 through 3 culminates here.