Skip to main content
data systems from the ground up

Consensus, Leader Election, and Why Your Database Goes Read-Only

5 min read Chapter 12 of 36

Consensus, Leader Election, and Why Your Database Goes Read-Only

The Black Box

The primary database crashes at 02:14:07 AM. The application begins returning 500 errors. At 02:14:19 AM, twelve seconds later, writes resume. The postmortem says “automatic failover to the replica.” The twelve seconds are not explained. They are the leader election.

The Mechanism

Raft consensus guarantees that a group of nodes agrees on a single leader. The relevant mechanics for understanding failover duration:

Terms. Time is divided into terms, each identified by a monotonically increasing integer. Each term has at most one leader. When a leader fails, a new term begins with an election.

Heartbeats. The leader sends heartbeats to all followers at a regular interval (typically 100-500ms). Each heartbeat resets the follower’s election timer.

Election timeout. If a follower’s election timer expires (no heartbeat received within the timeout), the follower transitions to the candidate state, increments the term, votes for itself, and sends vote requests to all other nodes.

Vote rules. A node votes for a candidate only if (1) the candidate’s term is higher than the node’s current term, and (2) the candidate’s log is at least as up-to-date as the voter’s log. A candidate that receives votes from a majority of nodes becomes the leader.

The Failover Timeline

The twelve seconds from the example break down:

PhaseDurationCause
Detection1-5sElection timeout waiting for missed heartbeats
Election0.5-2sVote request round trips, possible split vote retry
Log recovery0-5sNew leader replaying uncommitted entries
Connection drain1-3sApplication connections to old primary timing out

The election timeout is the dominant factor. A short timeout (500ms) detects failures fast but risks false elections during network blips. A long timeout (5s) avoids false elections but delays failover.

The Observable Consequence

PostgreSQL with Patroni

Patroni is a PostgreSQL high-availability controller that uses etcd (which uses Raft) for leader election.

# Concept: Patroni configuration controlling failover timing
# /etc/patroni/patroni.yml

bootstrap:
  dcs:
    ttl: 30                    # Leader lease duration (seconds)
    loop_wait: 10              # Health check interval (seconds)
    retry_timeout: 10          # Timeout for DCS operations
    maximum_lag_on_failover: 1048576  # Max replication lag (bytes) for promotion

# Failover timeline with these settings:
# 1. Primary crashes at T=0
# 2. Patroni on primary misses health check. etcd lease expires after TTL (30s worst case)
# 3. Patroni on replica detects leader key is gone (next loop_wait, up to 10s)
# 4. Patroni promotes replica to primary (1-3s for pg_ctl promote)
# 5. Total: 30 + 10 + 3 = 43 seconds worst case
# 6. With TTL=10, loop_wait=3: 10 + 3 + 3 = 16 seconds worst case

The maximum_lag_on_failover parameter prevents promoting a replica that is too far behind. If the only available replica has 100MB of replication lag, Patroni will not promote it, and the cluster remains read-only until the replica catches up or an operator intervenes. This is the scenario that produces a postmortem reading “failover took 8 minutes.”

Kafka KRaft

Kafka’s KRaft mode replaces ZooKeeper with a built-in Raft implementation for controller election:

# Concept: KRaft election timing
# server.properties

controller.quorum.election.timeout.ms=1000
controller.quorum.fetch.timeout.ms=2000

# A follower controller waits 2 seconds for a fetch response.
# If no response, it starts an election after 1 second (randomized).
# Total failover: 2 + 1 + election round trip = 3-5 seconds.
#
# During this window:
# - Existing partition leaders continue serving reads and writes
# - New partition leader elections are blocked
# - Topic creation/deletion is blocked
# - Consumer group rebalancing is blocked

The critical difference: Kafka’s data path (partition leaders) is independent of the controller. A controller election blocks metadata operations but not data read/write on existing partitions. PostgreSQL’s failover blocks all writes until the new primary is promoted.

The Code

Detecting failover from the application side:

// Concept: detecting and handling PostgreSQL failover in application code
// The connection to the old primary fails. The application must reconnect to the new primary.

// JDBC connection string with multiple hosts and target_session_attrs
String url = "jdbc:postgresql://primary:5432,replica1:5432,replica2:5432/logistics"
    + "?targetServerType=primary"
    + "&connectTimeout=5"
    + "&socketTimeout=10";

// targetServerType=primary: JDBC driver connects to the first host
// that reports itself as a primary (not in recovery mode).
// During failover:
// 1. Connection to old primary fails (socketTimeout = 10s)
// 2. Driver tries replica1. If Patroni promoted replica1, it is now primary. Connect.
// 3. If replica1 is still a replica, try replica2.
// 4. If no primary found, throw exception. Application retries.

The Decision Rule

Set the election timeout based on your availability budget. If 30 seconds of downtime per failover is acceptable, use Patroni’s defaults. If you need sub-10-second failover, reduce ttl to 10 and loop_wait to 3, and accept that network jitter may trigger unnecessary failovers.

Use synchronous replication (synchronous_commit = remote_apply) if losing committed transactions during failover is unacceptable. This guarantees the promoted replica has every committed transaction. Without synchronous replication, the promoted replica may be missing the most recent transactions that were committed on the old primary but not yet replicated.

The Raft election described in Chapter 4 is the reason the failover window exists at all. Understanding the election mechanics, the timeout, the vote round trip, the log recovery, allows you to estimate failover duration from configuration parameters rather than waiting for a production incident to measure it.