Skip to main content
data systems from the ground up

Split Brains, Network Partitions, and Clock Skew

7 min read Chapter 35 of 36

Split Brains, Network Partitions, and Clock Skew

The Black Box

The logistics platform runs PostgreSQL with streaming replication and automatic failover via Patroni/etcd. At 2:14 AM, a network switch between racks fails. The primary is in Rack A. The three etcd nodes are split: two in Rack B, one in Rack A. etcd’s Raft quorum (2 of 3) is in Rack B. Patroni in Rack B promotes a replica to primary. The old primary in Rack A is still running. For 47 seconds, both nodes accept writes. 23 orders are written to the old primary that never appear on the new primary.

The Mechanism

Split Brain Detection

Split brain detection after the fact relies on identifying divergent write histories. Every PostgreSQL promotion increments the timeline ID. If two nodes have different timeline IDs and both processed writes during the same time window, a split brain occurred.

-- Concept: post-incident split brain detection
-- Compare WAL positions and timeline IDs from both nodes

-- On Node A (old primary, Rack A):
SELECT
    pg_current_wal_lsn() AS current_lsn,
    timeline_id
FROM pg_control_system();
-- current_lsn: 0/5A000120, timeline_id: 3

-- On Node B (new primary, Rack B):
SELECT
    pg_current_wal_lsn() AS current_lsn,
    timeline_id
FROM pg_control_system();
-- current_lsn: 0/5A000060, timeline_id: 4

-- Node A has a higher LSN on timeline 3 (it kept writing after the partition).
-- Node B is on timeline 4 (it was promoted).
-- The WAL between 0/5A000060 and 0/5A000120 on timeline 3 contains the
-- 23 orders that exist on Node A but not on Node B.

Fencing the Old Primary

Fencing ensures the old primary stops accepting writes before (or immediately after) the new primary starts. The fencing mechanism must not depend on the same network path that partitioned:

# Concept: multi-layer fencing in Patroni

# Layer 1: PostgreSQL pg_terminate_backend via a different network path
# Patroni sends terminate commands through the management network, not the
# data network that is partitioned.

# Layer 2: Revoke the old primary's VIP (Virtual IP)
# The load balancer or keepalived instance that routes traffic to the primary
# is updated to point to the new primary. Clients connecting through the VIP
# automatically route to the new primary.

# Layer 3: iptables fencing (last resort)
# Block the old primary's PostgreSQL port from all client subnets.
iptables -A INPUT -p tcp --dport 5432 -s 10.0.0.0/16 -j DROP

# Layer 4: IPMI power-off (absolute last resort)
# ipmitool -H old-primary-bmc -U admin -P pass chassis power off

Network Partition Behavior by System

Each system in the logistics platform responds to network partitions differently:

PostgreSQL (with Patroni/etcd):

  • Partitioned primary loses its etcd lease.
  • If fencing works: old primary demotes itself or is killed. No split brain.
  • If fencing fails: split brain. Manual recovery required.
  • Recovery: pg_rewind to re-join the old primary to the new timeline, discarding the divergent writes.

Kafka:

  • Partitioned broker loses ZooKeeper/KRaft session.
  • Controller re-assigns partition leadership to in-sync replicas on the reachable side.
  • With min.insync.replicas=2: writes to partitions with insufficient replicas fail (consistency).
  • Recovery: When the partitioned broker re-joins, it truncates its log to the high watermark and fetches missing data from the leader. No split brain because Kafka’s replication protocol prevents it: a broker must be in the ISR to become leader.

Redis (Sentinel):

  • Sentinel promotes a replica if the primary is unreachable.
  • The old primary continues accepting writes during the partition.
  • Redis has no fencing mechanism. Split brain is expected.
  • Recovery: When the old primary re-joins, it becomes a replica and discards all writes from the split period.
  • This is why the logistics platform uses Redis only for cache data, never for data that cannot be regenerated.

Clock Skew Measurement

# Concept: measuring clock skew between nodes

# Check NTP synchronization status
chronyc tracking
# Reference ID    : A29FC801 (ntp1.example.com)
# Stratum         : 3
# System time     : 0.000023423 seconds fast of NTP time
# Last offset     : +0.000014332 seconds
# RMS offset      : 0.000018534 seconds

# 23 microseconds fast. Acceptable for most applications.

# Check maximum observed offset
chronyc sourcestats
# Name/IP Address   NP  NR  Span  Frequency  Freq Skew  Offset  Std Dev
# ntp1.example.com  14   8   213m   -0.002     0.012   -14us    8us
# ntp2.example.com  12   7   198m    0.005     0.018   +42us   15us

# Maximum offset: 42 microseconds. 
# If this exceeds your ordering tolerance, use logical clocks.

Hybrid Logical Clocks

A Hybrid Logical Clock (HLC) combines a physical timestamp with a logical counter. It provides timestamps that:

  • Are monotonically increasing (unlike wall clocks during NTP corrections).
  • Respect causality (if event A caused event B, A’s HLC timestamp is lower).
  • Are close to wall clock time (useful for human-readable display).
// Concept: simplified HLC implementation

class HybridLogicalClock {
    private long physicalTime;   // Wall clock component (milliseconds)
    private int logicalCounter;  // Logical component (breaks ties)

    // Called when a local event occurs
    synchronized long[] localEvent() {
        long now = System.currentTimeMillis();
        if (now > physicalTime) {
            physicalTime = now;
            logicalCounter = 0;
        } else {
            logicalCounter++;
        }
        return new long[]{physicalTime, logicalCounter};
    }

    // Called when receiving a message with a remote HLC timestamp
    synchronized long[] receiveEvent(long remotePhysical, int remoteLogical) {
        long now = System.currentTimeMillis();
        if (now > physicalTime && now > remotePhysical) {
            physicalTime = now;
            logicalCounter = 0;
        } else if (remotePhysical > physicalTime) {
            physicalTime = remotePhysical;
            logicalCounter = remoteLogical + 1;
        } else if (physicalTime > remotePhysical) {
            logicalCounter++;
        } else {
            // physicalTime == remotePhysical
            logicalCounter = Math.max(logicalCounter, remoteLogical) + 1;
        }
        return new long[]{physicalTime, logicalCounter};
    }
}

// Usage in the logistics platform:
// Each service maintains an HLC instance.
// Every event includes the HLC timestamp.
// Ordering by (physicalTime, logicalCounter) respects causality
// even when wall clocks disagree.

The Observable Consequence

The logistics platform’s audit trail with wall clock ordering vs HLC ordering during a 200ms clock skew between Service A and Service B:

Physical orderWall clock timestampHLC timestampEvent
1st14:22:00.200 (Service A, +200ms skew)(14:22:00.000, 0)Package scanned at WH-042
2nd14:22:00.050 (Service B, accurate)(14:22:00.050, 0)Package loaded on truck
3rd14:22:00.100 (Service B, accurate)(14:22:00.100, 0)Truck departed

Wall clock order: events 2, 3, 1. Wrong. The scan happened first.

HLC order: events 1, 2, 3. Correct. The HLC on Service A is updated by the local event. Service B receives the scan event’s HLC timestamp and advances its own clock past it, preserving the causal relationship.

The Decision Rule

For PostgreSQL with automatic failover: configure multi-layer fencing (VIP revocation + process termination + network fencing). Test the fencing by simulating a network partition. A failover system that has never been tested under partition is a split-brain system that has not split yet.

For Kafka: set min.insync.replicas=2 for any topic where data loss is unacceptable. Accept that writes to under-replicated partitions will fail during partitions. The producer will buffer and retry.

For Redis: never store data in Redis that cannot be regenerated from another source. Redis split brain is not a bug. It is a design property.

For event ordering across services: use Kafka partition offsets as the canonical ordering. Kafka offsets are strictly ordered within a partition and do not depend on wall clocks. If you need cross-partition ordering, use HLC timestamps. If you need only human-readable timestamps for display, use wall clock time but do not sort by it.