Skip to main content
surviving the spike

Cascading Failures, Circuit Breakers, and Bulkheads

7 min read Chapter 52 of 66

Cascading Failures, Circuit Breakers, and Bulkheads

The Symptom

Saturday night. 9:14 PM. The surge pricing service starts returning responses in 4 seconds instead of 50ms. The dashboard shows rider API CPU at 12%, memory at 40%, garbage collection normal. Nothing looks wrong except the one metric that matters: ride booking success rate dropped from 99.97% to 23%.

The surge pricing service is slow. The rider API is dead. These two facts are connected by a thread pool and 200 connections that refused to let go.

The Cause

The rider API calls the surge pricing service on every ride request. Under normal conditions, each call takes 50ms. The rider API has 200 Netty event loop threads handling requests. When the surge pricing service starts responding in 4 seconds, each thread blocks for 80x longer than normal.

At 500 requests per second, the math destroys you:

Normal:    500 RPS × 50ms  = 25 concurrent connections (12.5% of pool)
Degraded:  500 RPS × 4000ms = 2000 concurrent connections (1000% of pool)

The connection pool has 200 slots. After 0.4 seconds, every slot holds a connection waiting for surge pricing. New ride requests arrive, find no available connections, and queue. The queue fills. Timeouts fire. But timeouts are set to 30 seconds because someone once saw a legitimate 15-second response during a deployment.

The surge pricing service did not crash. It slowed down. And that slowdown propagated upstream through a shared resource pool, killing every feature that shares the rider API: ride booking, fare estimation, driver ETA, trip history. All dead because of one slow dependency.

This is a cascading failure. Service A depends on Service B. Service B degrades. Service A holds resources waiting for B. Service A runs out of resources. Everything behind A dies.

Timeline of a cascading failure:

T+0s   Surge pricing response time: 50ms → 4000ms
T+0.4s Rider API connection pool: 200/200 occupied
T+0.5s Ride booking requests start queueing
T+2s   Queue depth: 1000 requests
T+5s   Rider API health check times out
T+8s   Load balancer marks rider API unhealthy
T+10s  All rider API pods marked unhealthy
T+12s  0% of ride requests succeed

Three patterns prevent this: circuit breakers stop calling a failing dependency. Bulkheads isolate failure domains so one slow dependency cannot consume all resources. Retries with backoff and jitter recover gracefully without stampeding.

The Baseline

The rider API before resilience patterns:

// BOTTLENECK: No circuit breaker, no bulkhead, shared thread pool
@Service
public class RideBookingService {

    private final SurgePricingClient surgePricingClient;
    private final DriverMatchingClient driverMatchingClient;
    private final FareService fareService;

    public Mono<RideBooking> bookRide(RideRequest request) {
        return surgePricingClient.getMultiplier(request.getZoneId())
            .flatMap(multiplier ->
                fareService.calculate(request, multiplier))
            .flatMap(fare ->
                driverMatchingClient.findDriver(request, fare))
            .map(driver -> createBooking(request, driver));
    }
}
// BOTTLENECK: WebClient with no timeout isolation
@Component
public class SurgePricingClient {

    private final WebClient webClient;

    public Mono<BigDecimal> getMultiplier(String zoneId) {
        return webClient.get()
            .uri("/api/surge/{zoneId}", zoneId)
            .retrieve()
            .bodyToMono(SurgeResponse.class)
            .map(SurgeResponse::getMultiplier)
            .timeout(Duration.ofSeconds(30)); // 30s timeout, might as well be forever
    }
}

Every surge pricing call, driver matching call, and fare calculation shares the same WebClient connection pool. When surge pricing hangs, the pool fills, and driver matching calls that would succeed in 20ms cannot even start.

Load test baseline with all services healthy:

Locust: 500 users, 10 RPS per user

Metric          Value
p50 latency     120ms
p95 latency     280ms
p99 latency     410ms
Error rate      0.03%
Throughput      4,980 RPS

Load test with surge pricing at 4-second latency:

Locust: 500 users, 10 RPS per user, surge pricing at 4s

Metric          Value
p50 latency     28,400ms
p95 latency     timeout
p99 latency     timeout
Error rate      77%
Throughput      310 RPS

77% error rate. The surge pricing service is not down. It is slow. And that slowness killed the entire platform.

The Fix

Three layers of defense.

Layer 1: Circuit Breaker. When the surge pricing service fails repeatedly, stop calling it. Return a fallback value. Stop wasting connections on a service that is not responding.

Layer 2: Bulkhead. Give the surge pricing client its own limited connection pool. When those 20 connections fill up, the remaining 180 are still available for ride bookings that do not need surge pricing.

Layer 3: Retry with Backoff. When a call fails, retry with exponential delay and random jitter. Without backoff, 5,000 simultaneous retries kill the recovering service. Without jitter, 5,000 retries with the same delay hit at the same millisecond.

// SCALED: Resilience4j dependencies
// build.gradle.kts
dependencies {
    implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")
    implementation("io.github.resilience4j:resilience4j-reactor:2.2.0")
    implementation("io.github.resilience4j:resilience4j-micrometer:2.2.0")
}
# SCALED: application.yml - Resilience4j configuration
resilience4j:
  circuitbreaker:
    instances:
      surgePricing:
        slidingWindowSize: 20
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
        permittedNumberOfCallsInHalfOpenState: 5
        slidingWindowType: COUNT_BASED
        minimumNumberOfCalls: 10
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.reactive.function.client.WebClientResponseException.ServiceUnavailable
      driverMatching:
        slidingWindowSize: 20
        failureRateThreshold: 50
        waitDurationInOpenState: 15s
        permittedNumberOfCallsInHalfOpenState: 3

  bulkhead:
    instances:
      surgePricing:
        maxConcurrentCalls: 20
        maxWaitDuration: 500ms
      driverMatching:
        maxConcurrentCalls: 50
        maxWaitDuration: 1s

  retry:
    instances:
      surgePricing:
        maxAttempts: 3
        waitDuration: 100ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        enableRandomizedWait: true
        randomizedWaitFactor: 0.5
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException

The circuit breaker monitors the last 20 calls. If 50% fail, it opens. For 10 seconds, all calls return the fallback immediately. Then it moves to half-open, allowing 5 test calls. If those succeed, it closes. If they fail, it opens again for another 10 seconds.

The bulkhead limits surge pricing to 20 concurrent calls. The remaining capacity serves ride bookings.

The retry waits 100ms after the first failure, 200ms after the second, with random jitter up to 50% of the delay. Three attempts total. If all three fail and the circuit breaker is still closed, the circuit breaker records the failure.

The Proof

Load test with surge pricing at 4-second latency, circuit breaker + bulkhead + retry enabled:

# SCALED: Locust test for cascading failure with resilience patterns
from locust import HttpUser, task, between, events
import time

class RideBookingUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task(10)
    def book_ride(self):
        payload = {
            "riderId": f"rider-{self.environment.runner.user_count}",
            "pickupLat": 40.7128,
            "pickupLng": -74.0060,
            "dropoffLat": 40.7580,
            "dropoffLng": -73.9855,
            "zoneId": "manhattan-midtown"
        }
        with self.client.post("/api/rides/book", json=payload,
                              catch_response=True) as response:
            if response.status_code == 200:
                data = response.json()
                if data.get("degraded"):
                    response.success()  # Degraded but functional
                else:
                    response.success()
            elif response.status_code == 503:
                response.failure("Service unavailable")

    @task(3)
    def get_fare_estimate(self):
        self.client.get("/api/fares/estimate?zoneId=manhattan-midtown")

    @task(1)
    def get_trip_history(self):
        self.client.get("/api/trips/history?riderId=rider-1")

Results with resilience patterns:

Locust: 500 users, 10 RPS per user, surge pricing at 4s latency

                    Without Resilience   With Resilience
p50 latency         28,400ms            140ms
p95 latency         timeout             310ms
p99 latency         timeout             890ms
Error rate          77%                 0.4%
Throughput          310 RPS             4,850 RPS
Booking success     23%                 99.6%
Circuit state       N/A                 OPEN after 8s
Surge fallback      N/A                 Cached multiplier

The circuit breaker opened 8 seconds after the surge pricing degradation started. During those 8 seconds, 20 connections (the bulkhead limit) were occupied by slow surge pricing calls. The remaining 180 connections served ride bookings and fare estimates at near-normal latency.

After the circuit opened, surge pricing calls returned the cached multiplier in under 1ms. The rider got a ride at the last-known surge price instead of no ride at all.

Trip history and fare estimates continued unaffected throughout the incident because the bulkhead prevented surge pricing from consuming their connection capacity.

The 0.4% error rate came from the 8-second window before the circuit opened. Requests that were already queued behind the bulkhead’s 20 connections timed out at the 500ms maxWaitDuration. Those are the users who retried and got through on the second attempt.