Resilience as Architecture

The previous chapters addressed resilience at the component level: circuit breakers protect individual dependencies, bulkheads isolate failure domains, retries handle transient errors. These are local patterns applied to specific call sites. This chapter addresses resilience at the architectural level: how the entire system responds when demand exceeds capacity, when multiple failures compound, and when the correct response is to refuse work.

Load Shedding

A bulkhead limits concurrency for a specific dependency. Load shedding limits the total work the service accepts. When the payment service is processing 500 requests per second and the fraud detection service goes down, the circuit breaker opens and fallbacks activate. The service continues processing at 500 requests per second. But what if 500 requests per second with fallback processing is more load than the service can handle? The fallback might trigger cache lookups, database queries for manual review thresholds, and audit logging, consuming more resources than the original fraud check.

Load shedding rejects requests at the front door before any processing begins. The rejected request costs almost nothing: no thread allocation, no database connection, no downstream calls.

// PRODUCTION - Load shedding based on in-flight request count
@Component
public class LoadSheddingFilter implements Filter {

    private final AtomicInteger inFlight = new AtomicInteger(0);
    private final int maxInFlight;
    private final MeterRegistry meterRegistry;

    public LoadSheddingFilter(
            @Value("${load-shedding.max-in-flight:300}") int maxInFlight,
            MeterRegistry meterRegistry) {
        this.maxInFlight = maxInFlight;
        this.meterRegistry = meterRegistry;
    }

    @Override
    public void doFilter(ServletRequest request,
                         ServletResponse response,
                         FilterChain chain)
            throws IOException, ServletException {

        int current = inFlight.incrementAndGet();

        try {
            if (current > maxInFlight) {
                meterRegistry.counter("http.requests.shed").increment();

                HttpServletResponse httpResponse =
                        (HttpServletResponse) response;
                httpResponse.setStatus(503);
                httpResponse.setHeader("Retry-After", "5");
                httpResponse.getWriter().write(
                        "{\"error\":\"Service overloaded\"}");
                return;
            }

            chain.doFilter(request, response);

        } finally {
            inFlight.decrementAndGet();
        }
    }
}

The maxInFlight value is derived from Little’s Law: max_in_flight = target_throughput * average_latency. If the target is 500 rps and average latency is 100ms, then max_in_flight = 500 * 0.1 = 50. In practice, set it higher (300 in the example) to account for latency variance. The value should be determined by load testing: increase traffic until latency degrades, and set maxInFlight to the concurrency level at which latency is still acceptable.

Priority-Based Admission Control

Not all requests have equal value. A $10,000 payment is more important than a balance inquiry. A recurring payment from a premium customer is more important than a new customer’s first $5 payment. Under load, the system should shed low-priority work first.

// PRODUCTION - Priority-based request admission
@Component
public class PriorityAdmissionFilter implements Filter {

    private final AtomicInteger inFlight = new AtomicInteger(0);

    // Three tiers with different admission thresholds
    private static final int TIER_1_LIMIT = 300; // Critical: always admit
    private static final int TIER_2_LIMIT = 200; // Standard: shed first
    private static final int TIER_3_LIMIT = 100; // Background: shed early

    @Override
    public void doFilter(ServletRequest request,
                         ServletResponse response,
                         FilterChain chain)
            throws IOException, ServletException {

        HttpServletRequest httpRequest = (HttpServletRequest) request;
        int priority = determinePriority(httpRequest);
        int current = inFlight.incrementAndGet();

        try {
            int limit = switch (priority) {
                case 1 -> TIER_1_LIMIT;  // High-value payments, recurring
                case 2 -> TIER_2_LIMIT;  // Standard payments
                default -> TIER_3_LIMIT; // Balance inquiries, status checks
            };

            if (current > limit) {
                ((HttpServletResponse) response).setStatus(503);
                ((HttpServletResponse) response)
                        .setHeader("Retry-After", "5");
                return;
            }

            chain.doFilter(request, response);
        } finally {
            inFlight.decrementAndGet();
        }
    }

    private int determinePriority(HttpServletRequest request) {
        String path = request.getRequestURI();

        // Payment endpoints are higher priority than read endpoints
        if (path.startsWith("/payments")) {
            String amountHeader = request.getHeader("X-Payment-Amount");
            if (amountHeader != null) {
                BigDecimal amount = new BigDecimal(amountHeader);
                if (amount.compareTo(new BigDecimal("1000")) > 0) {
                    return 1; // High-value payment
                }
            }
            return 2; // Standard payment
        }

        return 3; // Everything else
    }
}

Under increasing load:

At 100 concurrent requests: all tiers admitted.
At 150 concurrent requests: tier 3 (balance inquiries) starts being shed. Payments continue normally.
At 250 concurrent requests: tier 2 (standard payments) starts being shed. High-value and recurring payments continue.
At 300+ concurrent requests: even tier 1 is shed. The system is at absolute capacity.

Backpressure Propagation Across Service Boundaries

In a reactive pipeline (Chapter 12), backpressure flows naturally through the reactive stream. In a synchronous service mesh, backpressure must be propagated explicitly using HTTP semantics.

When the payment service is shedding load (returning 503), the API gateway should propagate this signal upstream:

// PRODUCTION - API gateway backpressure propagation
@Component
public class BackpressureAwareGateway {

    private final RestClient paymentClient;
    private final AtomicInteger consecutiveRejects = new AtomicInteger(0);

    public PaymentResponse forwardPayment(PaymentRequest request) {
        try {
            ResponseEntity<PaymentResponse> response = paymentClient
                    .post()
                    .uri("/payments")
                    .body(request)
                    .retrieve()
                    .toEntity(PaymentResponse.class);

            if (response.getStatusCode() == HttpStatus.SERVICE_UNAVAILABLE) {
                int rejects = consecutiveRejects.incrementAndGet();

                // Exponential backoff on the gateway side
                if (rejects > 10) {
                    // Too many rejects: stop forwarding for a while
                    throw new ServiceOverloadedException(
                            "Payment service is shedding load",
                            parseRetryAfter(response));
                }

                throw new ServiceOverloadedException(
                        "Payment service returned 503",
                        parseRetryAfter(response));
            }

            consecutiveRejects.set(0);
            return response.getBody();

        } catch (ServiceOverloadedException e) {
            throw e;
        }
    }
}

The Retry-After header propagates the backpressure signal. The downstream service (payment service) tells the upstream (API gateway) how long to wait before sending more requests. The gateway can enforce this by rejecting requests from clients for that duration, or by queueing them.

Knowing When to Say No

The hardest resilience decision: refusing to serve a request that you could technically process. The payment service has capacity for 500 requests per second. A flash sale drives traffic to 2,000 requests per second. The service can process all 2,000 by using all thread pool capacity, draining connection pools, and increasing latency to 2 seconds per request. Every request eventually gets a response. No request is rejected. The SLO (99% under 500ms) is violated for 15 minutes. The error budget for the month is exhausted.

The alternative: shed load at 600 requests per second. 1,400 requests receive an immediate 503. The 600 admitted requests complete in 100ms. The SLO is maintained. The error budget is preserved. The shed requests can retry (with the Retry-After header) or be queued at the gateway.

The arithmetic is unambiguous: processing 600 requests well is better than processing 2,000 requests poorly. But the instinct to “handle everything” is strong. Load shedding feels like failure. In resilience engineering, load shedding is success: the system protected itself and maintained quality for the requests it accepted.

The Complete Resilience Architecture

The transaction platform’s resilience architecture, assembled from all chapters:

Client Request
  │
  ├─ Load Shedding Filter (CH19)
  │   └─ 503 if over capacity
  │
  ├─ Priority Admission Control (CH19)
  │   └─ Tier-based shedding under load
  │
  ├─ Degraded Mode Controller (CH17)
  │   └─ Routes to mode-specific processing
  │
  ├─ Per-Dependency Resilience Stack (CH9)
  │   ├─ Retry (CH5)
  │   ├─ Circuit Breaker (CH4)
  │   ├─ Rate Limiter (CH7)
  │   ├─ Bulkhead (CH6)
  │   ├─ Time Limiter (CH8)
  │   └─ HTTP Client with Timeouts (CH2)
  │
  ├─ Fallback Strategies (CH3)
  │   ├─ Cached Data (CH11)
  │   ├─ Default Values
  │   └─ Queue for Later (CH13)
  │
  ├─ Observability (CH16)
  │   ├─ Metrics per pattern per dependency
  │   ├─ SLO tracking and error budgets
  │   └─ Distributed tracing with resilience annotations
  │
  ├─ Testing (CH14, CH15)
  │   ├─ Integration tests with Testcontainers
  │   ├─ Contract tests for resilience boundaries
  │   └─ Chaos experiments with hypotheses
  │
  └─ Lifecycle (CH18)
      ├─ Graceful shutdown with connection draining
      ├─ Zero-downtime rolling updates
      └─ Cache warm-up before accepting traffic

Each layer is independent. Each layer is testable. Each layer has metrics. Each layer has alerts. The layers compose: load shedding protects the system from traffic spikes, admission control prioritizes valuable work, degraded modes route around broken dependencies, per-dependency patterns isolate failures, fallbacks provide acceptable alternatives, observability makes all of this visible, testing validates it works, and lifecycle management ensures the service’s own operational events do not become failure modes.

No single pattern makes a system resilient. The architecture of resilience emerges from the disciplined composition of individual patterns, each applied where it provides value, each configured based on measured behavior, and each observable to the team that operates the system.