Bulkhead Isolation and Retry Strategies
Bulkhead Isolation and Retry Strategies
The Symptom
The circuit breaker from CH18-S1 is deployed. It opens when surge pricing fails. Rides continue. The team celebrates. Then a different failure mode appears.
Tuesday. 2:30 PM. The surge pricing service does not fail. It degrades. Response times climb from 50ms to 5 seconds, slowly, over 20 minutes. The failure rate stays at 0% because every call eventually succeeds. The circuit breaker never opens because no calls fail. But 200 WebClient connections are occupied waiting for 5-second responses, and driver matching calls that need 20ms cannot acquire a connection.
The circuit breaker protects against failures. It does not protect against slowness that stays below the failure threshold.
The Cause
The rider API uses a single WebClient connection pool for all outbound calls. Surge pricing, driver matching, fare calculation, and trip history share the same 200 connections. When any dependency becomes slow, it consumes connections proportional to its response time. At 500 RPS with surge pricing taking 5 seconds:
Connections needed for surge pricing: 500 × 5 = 2,500
Connections available: 200
Connections left for everything else: 0
The circuit breaker watches failure rate, not connection consumption. Surge pricing returns 200 OK after 5 seconds. Success. The circuit stays closed. The connection pool drains.
A bulkhead isolates each dependency into its own resource pool. Surge pricing gets 20 connections. Driver matching gets 50. Trip history gets 30. When surge pricing consumes all 20 of its connections, the other 180 connections are untouched.
The Baseline
Without bulkhead isolation, a single slow dependency blocks all outbound calls:
// BOTTLENECK: Shared connection pool for all dependencies
@Configuration
public class WebClientConfig {
@Bean
public WebClient webClient() {
return WebClient.builder()
.baseUrl("http://internal-gateway")
.build();
// Default Netty pool: 200 connections, shared by all clients
}
}
// BOTTLENECK: All clients use the same WebClient
@Component
public class SurgePricingClient {
private final WebClient webClient; // shares pool with DriverMatchingClient
// ...
}
@Component
public class DriverMatchingClient {
private final WebClient webClient; // shares pool with SurgePricingClient
// ...
}
Load test showing the problem. Surge pricing at 5-second response time, 0% error rate:
Locust: 500 users, surge pricing at 5s latency, 0% surge errors
Metric Value
p50 booking 12,400ms
p95 booking timeout
Error rate 64%
Booking throughput 380 RPS
Driver matching blocked (no connections available)
Circuit breaker CLOSED (0% failure rate)
64% error rate. The circuit breaker is closed because surge pricing is not failing. It is succeeding slowly.
The Fix
SemaphoreBulkhead for Surge Pricing
# SCALED: Bulkhead configuration
resilience4j:
bulkhead:
instances:
surgePricing:
maxConcurrentCalls: 20
maxWaitDuration: 500ms
driverMatching:
maxConcurrentCalls: 50
maxWaitDuration: 1s
fareCalculation:
maxConcurrentCalls: 30
maxWaitDuration: 500ms
tripHistory:
maxConcurrentCalls: 15
maxWaitDuration: 2s
The semaphore bulkhead limits the number of concurrent calls. When 20 surge pricing calls are in flight, the 21st call waits up to 500ms for a slot. If no slot opens, it receives a BulkheadFullException and the fallback activates.
// SCALED: Surge pricing with bulkhead + circuit breaker
@Component
public class SurgePricingClient {
private final WebClient webClient;
private final ReactiveRedisTemplate<String, String> redis;
private final CircuitBreakerRegistry cbRegistry;
private final BulkheadRegistry bulkheadRegistry;
public Mono<BigDecimal> getMultiplier(String zoneId) {
CircuitBreaker cb = cbRegistry.circuitBreaker("surgePricing");
Bulkhead bulkhead = bulkheadRegistry.bulkhead("surgePricing");
return webClient.get()
.uri("/api/surge/{zoneId}", zoneId)
.retrieve()
.bodyToMono(SurgeResponse.class)
.map(SurgeResponse::getMultiplier)
.timeout(Duration.ofSeconds(2))
.transformDeferred(CircuitBreakerOperator.of(cb))
.transformDeferred(BulkheadOperator.of(bulkhead))
.onErrorResume(BulkheadFullException.class,
ex -> getFallbackMultiplier(zoneId))
.onErrorResume(CallNotPermittedException.class,
ex -> getFallbackMultiplier(zoneId))
.onErrorResume(ex -> getFallbackMultiplier(zoneId));
}
private Mono<BigDecimal> getFallbackMultiplier(String zoneId) {
return redis.opsForValue()
.get("surge:last_known:" + zoneId)
.map(BigDecimal::new)
.defaultIfEmpty(BigDecimal.ONE);
}
}
The order of decorators matters. Bulkhead wraps the circuit breaker wraps the actual call. When the bulkhead is full, the circuit breaker never sees the request. When the circuit is open, the bulkhead does not count the call against its limit.
ThreadPoolBulkhead for Driver Matching
// SCALED: ThreadPoolBulkhead for CPU-bound matching
@Component
public class DriverMatchingClient {
private final WebClient webClient;
private final KafkaTemplate<String, MatchRequest> kafkaTemplate;
private final ThreadPoolBulkheadRegistry tpbRegistry;
private final CircuitBreakerRegistry cbRegistry;
public Mono<MatchResult> findDriver(RideRequest request, FareEstimate fare) {
ThreadPoolBulkhead tpb = tpbRegistry.bulkhead("driverMatching");
CircuitBreaker cb = cbRegistry.circuitBreaker("driverMatching");
Supplier<CompletionStage<MatchResult>> supplier = () ->
webClient.post()
.uri("/api/matching/find")
.bodyValue(new MatchRequest(request, fare))
.retrieve()
.bodyToMono(MatchResult.class)
.timeout(Duration.ofSeconds(3))
.toFuture();
return Mono.fromCompletionStage(
Decorators.ofSupplier(supplier)
.withThreadPoolBulkhead(tpb)
.withCircuitBreaker(cb)
.withFallback(List.of(
BulkheadFullException.class,
CallNotPermittedException.class),
ex -> queueForRetry(request, fare).toFuture())
.decorate()
.get()
);
}
private Mono<MatchResult> queueForRetry(RideRequest req, FareEstimate fare) {
return Mono.fromFuture(
kafkaTemplate.send("driver-matching-retry",
req.getRideId(), new MatchRequest(req, fare))
).map(r -> MatchResult.pending(req.getRideId()));
}
}
# SCALED: ThreadPoolBulkhead config
resilience4j:
thread-pool-bulkhead:
instances:
driverMatching:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 25
keepAliveDuration: 100ms
The ThreadPoolBulkhead runs calls in a dedicated thread pool. If the pool and queue are full, new calls go to the Kafka fallback immediately. This isolates driver matching from any issues on the event loop.
Retry with Exponential Backoff and Jitter
# SCALED: Retry configuration with backoff and jitter
resilience4j:
retry:
instances:
surgePricing:
maxAttempts: 3
waitDuration: 100ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
exponentialBackoffMaxWaitDuration: 5s
enableRandomizedWait: true
randomizedWaitFactor: 0.5
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- io.github.resilience4j.circuitbreaker.CallNotPermittedException
- io.github.resilience4j.bulkhead.BulkheadFullException
Retry timing for 3 attempts with 0.5 jitter factor:
Attempt Base Delay Jitter Range Actual Delay
1 100ms 50ms - 150ms ~100ms (random)
2 200ms 100ms - 300ms ~200ms (random)
3 400ms 200ms - 600ms ~400ms (random)
Total ~700ms worst case
Without backoff, 5,000 clients retry simultaneously at the same interval. The recovering service faces 5,000 requests at T+100ms, 5,000 at T+200ms, 5,000 at T+300ms. Synchronized waves. The service recovers, gets hit, goes down, recovers, gets hit.
Without jitter, even with backoff, 5,000 clients retry at T+100ms, T+200ms, T+400ms. The waves are less frequent but still synchronized. The service faces 5,000 requests every wave.
With backoff and jitter, retries spread across the entire delay window. At T+100ms, maybe 300 clients retry. At T+120ms, another 280. The load spreads evenly, giving the recovering service time to catch up.
Retry with Circuit Breaker: The Combination
// SCALED: Retry wrapping circuit breaker
@Component
public class ResilientSurgePricingClient {
private final WebClient webClient;
private final ReactiveRedisTemplate<String, String> redis;
private final CircuitBreakerRegistry cbRegistry;
private final BulkheadRegistry bulkheadRegistry;
private final RetryRegistry retryRegistry;
public Mono<BigDecimal> getMultiplier(String zoneId) {
CircuitBreaker cb = cbRegistry.circuitBreaker("surgePricing");
Bulkhead bulkhead = bulkheadRegistry.bulkhead("surgePricing");
Retry retry = retryRegistry.retry("surgePricing");
return webClient.get()
.uri("/api/surge/{zoneId}", zoneId)
.retrieve()
.bodyToMono(SurgeResponse.class)
.map(SurgeResponse::getMultiplier)
.timeout(Duration.ofSeconds(2))
.transformDeferred(CircuitBreakerOperator.of(cb))
.transformDeferred(BulkheadOperator.of(bulkhead))
.transformDeferred(RetryOperator.of(retry))
.onErrorResume(ex -> getFallbackMultiplier(zoneId));
}
private Mono<BigDecimal> getFallbackMultiplier(String zoneId) {
return redis.opsForValue()
.get("surge:last_known:" + zoneId)
.map(BigDecimal::new)
.defaultIfEmpty(BigDecimal.ONE);
}
}
The decorator order from outermost to innermost: Retry → Bulkhead → CircuitBreaker → actual call.
When the circuit is open, retries stop. The CallNotPermittedException is in ignoreExceptions for the retry, so the retry does not waste attempts on a dependency the circuit breaker has already marked as failing. The fallback fires immediately.
When the circuit is closed, a transient failure triggers retry with backoff. If 3 retries fail, the circuit breaker records 3 failures. After enough failed calls reach the sliding window threshold, the circuit opens.
The Proof
Load test: surge pricing at 5-second latency (not failing, just slow).
# SCALED: Locust test for bulkhead isolation
from locust import HttpUser, task, between
class BulkheadTest(HttpUser):
wait_time = between(0.05, 0.2)
@task(10)
def book_ride(self):
self.client.post("/api/rides/book", json={
"riderId": f"rider-{self.environment.runner.user_count}",
"pickupLat": 40.7128, "pickupLng": -74.0060,
"dropoffLat": 40.7580, "dropoffLng": -73.9855,
"zoneId": "manhattan-midtown"
})
@task(5)
def fare_estimate(self):
self.client.get("/api/fares/estimate?zoneId=manhattan-midtown")
@task(2)
def trip_history(self):
self.client.get("/api/trips/history?riderId=rider-1")
Results with surge pricing at 5-second latency:
Locust: 500 users, surge pricing at 5s latency (0% errors)
No Bulkhead With Bulkhead
p50 booking 12,400ms 160ms
p95 booking timeout 340ms
p99 booking timeout 920ms
Booking error rate 64% 0.8%
Booking throughput 380 RPS 4,780 RPS
Fare estimate p50 timeout 130ms
Trip history p50 timeout 90ms
Surge calls in-flight 200 (all) 20 (capped)
Circuit breaker state CLOSED CLOSED
Bulkhead rejections N/A ~480/min
Without the bulkhead, surge pricing consumed all 200 connections. Bookings, fare estimates, and trip history all starved. The circuit breaker stayed closed because surge pricing was not failing.
With the bulkhead, surge pricing consumed exactly 20 connections. The remaining 180 served bookings, fare estimates, and trip history at near-normal latency. The 480 bulkhead rejections per minute went to the fallback, which returned cached multipliers in under 1ms.
The 0.8% error rate came from requests that entered the bulkhead queue, waited the full 500ms maxWaitDuration, got a bulkhead slot, then waited 5 seconds for surge pricing, and timed out at the 2-second WebClient timeout. Those requests hit the fallback on the timeout path.
The timeline with bulkhead and circuit breaker combined:
T+0s Surge pricing latency jumps to 5s
T+0s Bulkhead limits surge calls to 20 concurrent
T+0s Remaining 180 connections serve bookings normally
T+4s Surge pricing calls start timing out (2s timeout)
T+4s Circuit breaker starts recording failures
T+8s Circuit breaker opens (50% failure rate over 20 calls)
T+8s All surge pricing calls return cached fallback instantly
T+8s Bulkhead releases all 20 slots
T+8s p99 drops to 95ms (no surge pricing network calls)
The bulkhead bought 8 seconds of containment while the circuit breaker accumulated enough data to open. Without the bulkhead, those 8 seconds would have killed the entire platform.