Skip to main content
data systems from the ground up

Cascading Failures and the Recovery Playbook

8 min read Chapter 36 of 36

Cascading Failures and the Recovery Playbook

The Black Box

Tuesday, 14:32. The logistics platform’s PostgreSQL database runs a checkpoint during peak afternoon traffic. For 8 seconds, query latency increases from 2ms to 180ms. This is normal behavior. The checkpoint completes. PostgreSQL returns to 2ms latency. But at 14:33, the on-call engineer receives 47 alerts: route optimizer timeout, dispatch service failure, notification service backlog. By 14:35, the entire logistics pipeline is down. The root cause is a routine 8-second checkpoint. The actual failure is the absence of cascading failure protection.

The Mechanism

The Cascade Anatomy

The failure propagates through four stages:

Stage 1: The slow resource (T+0s to T+8s) PostgreSQL checkpoint causes 180ms query latency. The package service, which normally responds in 5ms, now responds in 185ms.

Stage 2: Connection pool exhaustion (T+3s to T+10s) The package service uses a HikariCP connection pool with 10 connections. At normal 5ms query latency, 10 connections handle 2,000 queries/sec. At 180ms latency, 10 connections handle 55 queries/sec. Incoming requests queue for a connection.

// Concept: connection pool saturation during a checkpoint

// HikariCP configuration
HikariConfig config = new HikariConfig();
config.setMaximumPoolSize(10);
config.setConnectionTimeout(5000);    // Wait 5 seconds for a connection

// Normal operation:
// 10 connections × (1000ms / 5ms per query) = 2,000 queries/sec capacity

// During checkpoint:
// 10 connections × (1000ms / 180ms per query) = 55 queries/sec capacity
// Incoming rate: 200 queries/sec
// Deficit: 145 queries/sec queuing for connections
// After 3 seconds: 435 requests queued
// After 5 seconds: connection timeout exceptions begin

Stage 3: Thread starvation (T+5s to T+15s) The route optimizer calls the package service via HTTP. Each call blocks a thread while waiting for a response. The route optimizer has a thread pool of 50 threads. At 200ms response time (package service), all 50 threads are consumed within 10 seconds. New requests to the route optimizer queue indefinitely.

Stage 4: Upstream propagation (T+10s to T+30s) The dispatch service calls the route optimizer. The route optimizer is not responding (all threads consumed). The dispatch service’s HTTP client times out after 10 seconds. But it retries 3 times. Each retry waits 10 seconds. One dispatch request now consumes 30 seconds of thread time. The dispatch service’s thread pool exhausts. Alerts fire.

Circuit Breaker

The circuit breaker stops the cascade at Stage 3 by failing fast instead of waiting.

// Concept: circuit breaker with resilience4j

// Configuration
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(10)               // Track last 10 calls
    .failureRateThreshold(50)            // Open circuit if 50% of calls fail
    .waitDurationInOpenState(Duration.ofSeconds(30))  // Stay open for 30 seconds
    .permittedNumberOfCallsInHalfOpenState(3)         // Test 3 calls when half-open
    .slowCallDurationThreshold(Duration.ofSeconds(2)) // Calls > 2s count as slow
    .slowCallRateThreshold(80)           // Open if 80% of calls are slow
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("package-service", config);

// Usage in the route optimizer
Supplier<RouteData> decoratedCall = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> packageServiceClient.getPackages(routeId));

try {
    RouteData data = decoratedCall.get();
    // Normal processing
} catch (CallNotPermittedException e) {
    // Circuit is OPEN. Return cached data or a degraded response.
    // This call took < 1ms instead of 10 seconds.
    return cachedRouteData.get(routeId);
}

With the circuit breaker, the cascade stops:

  • T+0s: Checkpoint begins. Package service slows down.
  • T+5s: 5 of 10 calls to package service exceed 2s threshold. Circuit opens.
  • T+5s onward: Route optimizer rejects calls to package service immediately. Returns cached data. Thread pool is not exhausted. Dispatch service continues working.
  • T+35s: Circuit transitions to half-open. Tests 3 calls. Package service has recovered. Circuit closes.

Total impact: 5 seconds of degraded route optimization. No cascading failure.

Bulkhead Pattern

A bulkhead isolates failure domains so that a failure in one area does not consume resources needed by another.

// Concept: separate thread pools for separate downstream services
// If the package service fails, the notification service is unaffected.

// Without bulkheads: one shared thread pool
ExecutorService sharedPool = Executors.newFixedThreadPool(50);
// Package service calls and notification service calls compete for the same 50 threads.
// If package service is slow, it consumes all 50 threads.
// Notification calls cannot execute.

// With bulkheads: isolated thread pools
ExecutorService packageServicePool = Executors.newFixedThreadPool(20);
ExecutorService notificationServicePool = Executors.newFixedThreadPool(15);
ExecutorService routeServicePool = Executors.newFixedThreadPool(15);

// If package service is slow, it consumes at most 20 threads.
// Notification service still has its dedicated 15 threads.
// The failure is contained.

The Observable Consequence

The logistics platform with and without cascading failure protection during a 10-second PostgreSQL checkpoint:

Without protection:

TimePackage ServiceRoute OptimizerDispatch ServiceNotifications
T+0sSlow (180ms)NormalNormalNormal
T+5sTimeoutsThreads fillingNormalNormal
T+10sRecoveringThread pool fullTimeouts startingQueue growing
T+15sNormalStill saturatedThread pool fullBacklogged
T+30sNormalRecoveringRecoveringProcessing backlog
T+60sNormalNormalNormalBacklog cleared

Recovery time: 60 seconds. User-visible impact: 50 seconds.

With circuit breakers and bulkheads:

TimePackage ServiceRoute OptimizerDispatch ServiceNotifications
T+0sSlow (180ms)NormalNormalNormal
T+5sTimeoutsCircuit OPEN, cached dataNormalNormal
T+10sRecoveringCircuit OPEN, cached dataNormalNormal
T+35sNormalCircuit CLOSEDNormalNormal

Recovery time: 35 seconds. User-visible impact: route data was 30 seconds stale. No service went down.

The Code

Recovery Sequence After a Major Failure

# Concept: structured recovery after a multi-component failure
# Order matters. Starting application services before storage is recovered
# causes secondary failures.

# Step 1: Storage layer
# Wait for PostgreSQL crash recovery to complete
pg_isready -h localhost -p 5432
# Returns 0 when PostgreSQL has finished WAL replay and accepts connections.

# Step 2: Message layer
# Check Kafka cluster health
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
    --cluster-id $(kafka-storage.sh info -c /etc/kafka/server.properties | grep 'Cluster ID' | awk '{print $3}')

# Check for under-replicated partitions
kafka-topics.sh --describe --under-replicated-partitions --bootstrap-server localhost:9092
# If any: wait for ISR to re-form before starting consumers.

# Step 3: Application services (in dependency order)
# Start services that have no downstream dependencies first.
systemctl start package-service        # Depends on: PostgreSQL, Kafka
systemctl start notification-service   # Depends on: Kafka
systemctl start route-optimizer        # Depends on: package-service
systemctl start dispatch-service       # Depends on: route-optimizer

# Step 4: Derived stores
# Redis cache: flush and warm
redis-cli FLUSHDB
# Let application requests populate the cache organically,
# or run a cache-warming script:
# java -jar cache-warmer.jar --source postgresql --target redis

# Step 5: Verify data integrity

Post-Incident Data Integrity Checks

-- Concept: checking for data inconsistencies after recovery

-- 1. Check for gaps in event sequence
WITH event_gaps AS (
    SELECT
        package_id,
        event_sequence,
        LAG(event_sequence) OVER (PARTITION BY package_id ORDER BY event_sequence) AS prev_seq,
        event_sequence - LAG(event_sequence) OVER (PARTITION BY package_id ORDER BY event_sequence) AS gap
    FROM package_events
    WHERE created_at > now() - interval '24 hours'
)
SELECT package_id, prev_seq, event_sequence, gap
FROM event_gaps
WHERE gap > 1
ORDER BY gap DESC;

-- 2. Check for duplicate events (idempotency key violations)
SELECT idempotency_key, COUNT(*) as occurrences
FROM package_events
WHERE created_at > now() - interval '24 hours'
GROUP BY idempotency_key
HAVING COUNT(*) > 1;

-- 3. Compare Kafka consumer offsets with expected message counts
-- If committed offset < topic high watermark, events may not have been processed.
# Check Kafka consumer lag (events not yet processed)
kafka-consumer-groups.sh --describe --group package-processor \
    --bootstrap-server localhost:9092

# TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# package-events  0          14523           14523           0
# package-events  1          12891           13402           511
# package-events  2          15102           15102           0

# Partition 1 has 511 unprocessed events.
# These are events that arrived during the outage and need to be consumed.
# The consumer will process them automatically when it resumes.

The Decision Rule

Every inter-service call needs a circuit breaker. The default configuration: 50% failure rate threshold, 30-second open duration, 2-second slow call threshold. Tune from there based on observed latency distributions.

Every service needs bulkheads around downstream dependencies. Use separate thread pools, connection pools, or rate limiters for each downstream service. The pool size for each downstream should be calculated from the downstream’s capacity: pool_size = target_throughput x downstream_latency_p99.

Set retry budgets, not retry counts. Instead of “retry 3 times,” use “retry for up to 5 seconds total.” This prevents retries from multiplying the load during a slow downstream failure.

Recovery sequence is fixed: storage, messaging, application (in dependency order), derived stores. Write the sequence in a runbook. Test it quarterly. A recovery plan that has never been executed is not a plan. It is a hypothesis.

After every major incident, run the data integrity checks. Compare Kafka consumer offsets with topic high watermarks. Check for event gaps and duplicates. Cross-reference the PostgreSQL WAL LSN with the expected state. The recovery is not complete until the data is verified.