Backpressure as a Native Resilience Mechanism
Backpressure as a Native Resilience Mechanism
In imperative code, a thread pool limits concurrency. When the pool is full, new tasks are rejected or queued. In reactive code, backpressure is the equivalent mechanism: a subscriber tells the publisher how many items it can handle, and the publisher respects that limit.
Backpressure in the Payment Pipeline
The notification service receives payment completion events from the payment service via a reactive stream. Under normal operation, the notification service processes 200 events per second. During a flash sale, the payment service produces 2,000 events per second. Without backpressure, the notification service buffers all 2,000 events in memory, eventually running out of heap space.
// PROBLEM - Unbounded buffering
paymentEvents.flatMap(event ->
notificationClient.sendNotification(event))
// flatMap subscribes to all inner Monos concurrently
// 2,000 concurrent HTTP calls to the email provider
flatMap with no concurrency limit subscribes to every inner Mono concurrently. With 2,000 events per second, 2,000 HTTP calls fire simultaneously. The email provider rejects most of them (rate limit), creating 2,000 error responses that trigger retries, amplifying the problem.
// PRODUCTION - Bounded concurrency with flatMap
paymentEvents.flatMap(event ->
notificationClient.sendNotification(event),
50) // Max 50 concurrent notifications
// When 50 are in flight, backpressure propagates upstream.
// The payment events publisher pauses until a slot opens.
The second argument to flatMap limits concurrency to 50. When 50 notifications are in flight, flatMap stops requesting items from the upstream publisher. The upstream publisher (the payment events source) buffers or pauses. This is backpressure: a demand signal flowing from subscriber to publisher.
Overflow Strategies
When the upstream cannot be paused (external event source, Kafka consumer, HTTP request stream), an explicit overflow strategy is needed:
// PRODUCTION - Overflow strategies for uncontrollable sources
paymentEvents
// Buffer up to 1000 events, drop oldest when full
.onBackpressureBuffer(1000,
dropped -> meterRegistry.counter("events.dropped",
"reason", "backpressure").increment(),
BufferOverflowStrategy.DROP_OLDEST)
.flatMap(event ->
notificationClient.sendNotification(event), 50);
// Alternative: drop newest (keep older events, discard new ones)
paymentEvents
.onBackpressureDrop(dropped ->
meterRegistry.counter("events.dropped",
"reason", "backpressure").increment())
.flatMap(event ->
notificationClient.sendNotification(event), 50);
The choice between DROP_OLDEST and DROP_LATEST depends on the business requirement. For notifications, DROP_OLDEST makes sense: if we are falling behind, notify the most recent customers first. For audit logging, neither dropping strategy is acceptable; the events must be persisted to a durable store (Kafka dead letter topic) for later processing.
Backpressure + Circuit Breaker
When the notification service’s email provider has an outage, the circuit breaker opens. Calls to the email provider are rejected immediately by the circuit breaker. With flatMap concurrency of 50, all 50 slots complete instantly (circuit breaker rejection takes microseconds). The flatMap immediately requests 50 more items, which are also rejected instantly. The pipeline processes items as fast as the upstream can produce them, but every item is rejected.
This is correct behavior, but the metrics look alarming: thousands of circuit breaker rejections per second. The fix is to add a delay when the circuit breaker is open:
// PRODUCTION - Rate-limited processing during circuit breaker open state
paymentEvents
.flatMap(event ->
notificationClient.sendNotification(event)
.transformDeferred(
CircuitBreakerOperator.of(circuitBreaker))
.onErrorResume(CallNotPermittedException.class,
e -> Mono.delay(Duration.ofSeconds(1))
.then(Mono.empty())),
50);
When the circuit breaker is open, each rejection is followed by a 1-second delay before the slot is freed. This reduces the rejection rate from thousands per second to 50 per second (50 slots, each delayed by 1 second). The events accumulate in the buffer, which is bounded, and excess events are dropped with a metric.
This pattern, slowing down processing during circuit breaker open state, prevents the pipeline from churning through events that cannot be processed, while still allowing the circuit breaker’s half-open probes to detect recovery.