Skip to main content
resilience patterns in production

Circuit Breaker

10 min read Chapter 9 of 40

Circuit Breaker

A timeout protects a single call. A circuit breaker protects a service from a failing dependency by stopping calls entirely when the failure rate crosses a threshold. The timeout says “this call took too long, give up.” The circuit breaker says “the last 50 calls to this dependency failed, stop trying.”

Without a circuit breaker, every request to the payment service still attempts the fraud detection call, waits for the timeout, and fails. The timeout protects the thread from waiting forever, but the thread is still occupied for the timeout duration (500ms in our configuration). At 100 requests per second, that is 50 threads permanently consumed by calls that will time out. With a circuit breaker, once the failure rate exceeds the threshold, subsequent calls are rejected immediately without making the HTTP request. Thread occupation drops to near zero. The failing dependency gets relief from incoming requests. The payment service recovers capacity for other work.

The State Machine

Circuit Breaker State Machine

The circuit breaker has three states. CLOSED is the normal state: requests pass through, and the circuit breaker counts successes and failures in a sliding window. When the failure rate in the sliding window exceeds a threshold (e.g., 50% of the last 100 calls), the breaker transitions to OPEN. In the OPEN state, all requests are rejected immediately with a CallNotPermittedException. No HTTP call is made. After a configurable wait duration (e.g., 60 seconds), the breaker transitions to HALF_OPEN. In HALF_OPEN, a limited number of probe requests are allowed through. If the probes succeed, the breaker transitions back to CLOSED. If the probes fail, the breaker transitions back to OPEN and the wait timer resets.

The Failure Mode

Without a circuit breaker on fraud detection:

  1. External scoring API becomes slow (response time: 5 seconds)
  2. Fraud detection becomes slow (response time: 5 seconds)
  3. Payment service sends 100 requests/second to fraud detection
  4. Each request holds a thread for 500ms (timeout fires)
  5. 50 threads permanently occupied by timing-out fraud calls
  6. 150 threads remaining for all other work
  7. If fraud detection stays degraded, 25% of thread pool capacity is permanently lost

With a circuit breaker:

  1. External scoring API becomes slow
  2. First 100 calls fail (sliding window fills)
  3. Failure rate exceeds 50% threshold
  4. Circuit breaker opens
  5. Subsequent calls rejected in microseconds, not milliseconds
  6. Thread occupation from fraud calls: near zero
  7. Fallback returns auto-approved score for low-value transactions
  8. After 60 seconds, half-open probe tests if fraud detection recovered

The Internals: From Scratch

// FROM SCRATCH - Circuit breaker with sliding window
public class CircuitBreaker<T> {

    public enum State { CLOSED, OPEN, HALF_OPEN }

    private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
    private final int slidingWindowSize;
    private final double failureRateThreshold;
    private final Duration waitDurationInOpenState;
    private final int permittedCallsInHalfOpen;

    // Sliding window: circular buffer of call outcomes
    // true = success, false = failure
    private final AtomicReferenceArray<Boolean> slidingWindow;
    private final AtomicInteger windowIndex = new AtomicInteger(0);
    private final AtomicInteger failureCount = new AtomicInteger(0);
    private final AtomicInteger totalRecorded = new AtomicInteger(0);

    // Open state timing
    private volatile long openedAt = 0;

    // Half-open state: count of permitted probe calls
    private final AtomicInteger halfOpenCallCount = new AtomicInteger(0);

    public CircuitBreaker(int slidingWindowSize, double failureRateThreshold,
                          Duration waitDurationInOpenState, int permittedCallsInHalfOpen) {
        this.slidingWindowSize = slidingWindowSize;
        this.failureRateThreshold = failureRateThreshold;
        this.waitDurationInOpenState = waitDurationInOpenState;
        this.permittedCallsInHalfOpen = permittedCallsInHalfOpen;
        this.slidingWindow = new AtomicReferenceArray<>(slidingWindowSize);
    }

    public T execute(Supplier<T> supplier, Supplier<T> fallback) {
        State currentState = state.get();

        if (currentState == State.OPEN) {
            // Check if wait duration has elapsed
            if (System.currentTimeMillis() - openedAt >= waitDurationInOpenState.toMillis()) {
                // Attempt transition to HALF_OPEN
                // Only one thread should succeed in this CAS operation
                if (state.compareAndSet(State.OPEN, State.HALF_OPEN)) {
                    halfOpenCallCount.set(0);
                }
            } else {
                // Still in wait period, reject immediately
                return fallback.get();
            }
        }

        if (state.get() == State.HALF_OPEN) {
            // Only permit a limited number of probe calls
            int probeCount = halfOpenCallCount.incrementAndGet();
            if (probeCount > permittedCallsInHalfOpen) {
                // Too many concurrent probes, reject
                return fallback.get();
            }
        }

        // Execute the call
        try {
            T result = supplier.get();
            recordSuccess();
            return result;
        } catch (Exception e) {
            recordFailure();
            return fallback.get();
        }
    }

    private void recordSuccess() {
        if (state.get() == State.HALF_OPEN) {
            // In half-open, a success transitions back to closed
            // Reset the sliding window
            if (state.compareAndSet(State.HALF_OPEN, State.CLOSED)) {
                resetSlidingWindow();
            }
            return;
        }
        recordOutcome(true);
    }

    private void recordFailure() {
        if (state.get() == State.HALF_OPEN) {
            // In half-open, a failure transitions back to open
            if (state.compareAndSet(State.HALF_OPEN, State.OPEN)) {
                openedAt = System.currentTimeMillis();
            }
            return;
        }
        recordOutcome(false);
        checkThreshold();
    }

    private void recordOutcome(boolean success) {
        int index = windowIndex.getAndUpdate(i -> (i + 1) % slidingWindowSize);
        Boolean previous = slidingWindow.getAndSet(index, success);

        if (previous != null && !previous) {
            // Replacing a previous failure: decrement failure count
            failureCount.decrementAndGet();
        }
        if (!success) {
            failureCount.incrementAndGet();
        }

        int recorded = totalRecorded.get();
        if (recorded < slidingWindowSize) {
            totalRecorded.incrementAndGet();
        }
    }

    private void checkThreshold() {
        int recorded = totalRecorded.get();
        if (recorded < slidingWindowSize) {
            // Not enough data yet
            return;
        }
        double failureRate = (double) failureCount.get() / slidingWindowSize * 100;
        if (failureRate >= failureRateThreshold) {
            if (state.compareAndSet(State.CLOSED, State.OPEN)) {
                openedAt = System.currentTimeMillis();
            }
        }
    }

    private void resetSlidingWindow() {
        for (int i = 0; i < slidingWindowSize; i++) {
            slidingWindow.set(i, null);
        }
        failureCount.set(0);
        totalRecorded.set(0);
        windowIndex.set(0);
    }

    public State getState() {
        return state.get();
    }
}

What the Scratch Implementation Reveals

Building the circuit breaker from scratch exposes three critical details that Resilience4J hides behind configuration:

The sliding window is a concurrency challenge. The circular buffer with AtomicReferenceArray and AtomicInteger for the index is correct for basic cases, but under very high concurrency, the getAndUpdate on the index and the getAndSet on the array are not atomic together. Two threads can read the same index, and one outcome is lost. Resilience4J solves this with a more sophisticated ring buffer implementation. The point is not that the from-scratch implementation is production-ready. The point is that you now understand why the sliding window size matters: a window of 10 calls means the circuit breaker makes decisions on very little data, and two concurrent failures can swing the failure rate by 20%.

The HALF_OPEN state has a thundering herd problem. When the wait duration expires, every thread that arrives simultaneously tries the compareAndSet(OPEN, HALF_OPEN) transition. One succeeds. The others see HALF_OPEN and check halfOpenCallCount. If permittedCallsInHalfOpen is 1, only one probe gets through. If 50 threads arrive in the same millisecond, 49 threads execute the fallback. This is correct behavior, but you must understand it to configure the probe count. Setting permittedCallsInHalfOpen to 1 means a single failed probe sends the breaker back to OPEN. Setting it to 10 means 10 requests test the recovering dependency simultaneously, which could overwhelm it if it has not fully recovered.

The failure rate threshold is only meaningful with a full window. Until the sliding window is full, the failure rate calculation is unreliable. Two failures in a window of 3 recorded calls is a 67% failure rate, enough to trip a 50% threshold. But two failures out of the first three calls after startup might be a DNS resolution delay, not a real failure. Resilience4J addresses this with minimumNumberOfCalls: the breaker does not calculate the failure rate until at least N calls have been recorded.

The Production Implementation

# PRODUCTION - application.yml
resilience4j:
  circuitbreaker:
    instances:
      fraudDetection:
        sliding-window-type: COUNT_BASED
        sliding-window-size: 100
        # Evaluate failure rate over the last 100 calls.
        # COUNT_BASED is appropriate for steady-traffic services.
        # Use TIME_BASED for bursty traffic.

        failure-rate-threshold: 50
        # Open the circuit when 50% of calls in the window fail.
        # 50% means the dependency is failing as often as succeeding.
        # For the fraud service, this is clearly broken.

        minimum-number-of-calls: 20
        # Do not evaluate failure rate until at least 20 calls recorded.
        # Prevents false positives during startup or low-traffic periods.

        wait-duration-in-open-state: 60s
        # Wait 60 seconds before probing.
        # Long enough for transient issues (GC pauses, deployment rollouts)
        # to resolve. Short enough to resume service within a reasonable time.

        permitted-number-of-calls-in-half-open-state: 5
        # Send 5 probe requests in half-open.
        # More than 1 to avoid a single unlucky request keeping the breaker open.
        # Fewer than 20 to avoid overwhelming a recovering service.

        automatic-transition-from-open-to-half-open-enabled: true
        # Do not wait for a request to trigger the OPEN->HALF_OPEN transition.
        # Use a timer. This matters during low-traffic periods where the
        # breaker could stay open for hours waiting for a request.

        record-exceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.ResourceAccessException
        # Only count these as failures for the circuit breaker.
        # Do not count 4xx responses (client errors, not service failures).

        ignore-exceptions:
          - com.txn.payment.exception.FraudCheckDeclinedException
        # A decline is a valid business response, not a failure.

  timelimiter:
    instances:
      fraudDetection:
        timeout-duration: 2s
        # Separate from the HTTP client timeout.
        # This is the Resilience4J-managed timeout that wraps the entire call.
// PRODUCTION - Circuit breaker with Spring Boot and Resilience4J
@Service
public class FraudDetectionService {

    private final FraudDetectionClient fraudClient;
    private final FraudFallback fallback;
    private final CircuitBreakerRegistry circuitBreakerRegistry;

    public FraudDetectionService(FraudDetectionClient fraudClient,
                                  FraudFallback fallback,
                                  CircuitBreakerRegistry circuitBreakerRegistry) {
        this.fraudClient = fraudClient;
        this.fallback = fallback;
        this.circuitBreakerRegistry = circuitBreakerRegistry;
    }

    @io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker(
            name = "fraudDetection", fallbackMethod = "fraudFallback")
    @io.github.resilience4j.timelimiter.annotation.TimeLimiter(
            name = "fraudDetection")
    public FraudScore checkFraud(PaymentRequest request) {
        return fraudClient.score(request);
    }

    private FraudScore fraudFallback(PaymentRequest request, Throwable cause) {
        return fallback.fallbackScore(request, cause);
    }
}

The Test

// PRODUCTION - Integration test proving circuit breaker behavior
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Testcontainers
class CircuitBreakerIntegrationTest {

    @Container
    static GenericContainer<?> wireMock = new GenericContainer<>(
            DockerImageName.parse("wiremock/wiremock:latest"))
            .withExposedPorts(8080);

    @DynamicPropertySource
    static void configureProperties(DynamicPropertyRegistry registry) {
        registry.add("fraud.service.url", () ->
                "http://localhost:" + wireMock.getMappedPort(8080));
        // Smaller window for testing
        registry.add("resilience4j.circuitbreaker.instances.fraudDetection.sliding-window-size",
                () -> "10");
        registry.add("resilience4j.circuitbreaker.instances.fraudDetection.minimum-number-of-calls",
                () -> "5");
        registry.add("resilience4j.circuitbreaker.instances.fraudDetection.wait-duration-in-open-state",
                () -> "5s");
    }

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @Autowired
    private FraudDetectionService fraudDetectionService;

    @Test
    void circuitBreaker_opensOnFailures_closesOnRecovery() {
        io.github.resilience4j.circuitbreaker.CircuitBreaker breaker =
                circuitBreakerRegistry.circuitBreaker("fraudDetection");

        // Phase 1: Fraud service returns errors
        stubFraudServiceError();
        for (int i = 0; i < 10; i++) {
            try {
                fraudDetectionService.checkFraud(samplePayment());
            } catch (Exception ignored) {}
        }

        // Circuit breaker should be OPEN
        assertThat(breaker.getState())
                .isEqualTo(io.github.resilience4j.circuitbreaker.CircuitBreaker.State.OPEN);

        // Phase 2: Calls rejected without hitting fraud service
        long start = System.nanoTime();
        try {
            fraudDetectionService.checkFraud(samplePayment());
        } catch (Exception ignored) {}
        long elapsed = Duration.ofNanos(System.nanoTime() - start).toMillis();

        // Rejection should be near-instant (< 10ms), not 500ms (timeout)
        assertThat(elapsed).isLessThan(50);

        // Phase 3: Wait for half-open, restore fraud service, verify recovery
        stubFraudServiceHealthy();
        await().atMost(Duration.ofSeconds(10))
                .until(() -> breaker.getState() !=
                        io.github.resilience4j.circuitbreaker.CircuitBreaker.State.OPEN);

        // Send probe requests
        for (int i = 0; i < 5; i++) {
            FraudScore score = fraudDetectionService.checkFraud(samplePayment());
            assertThat(score.approved()).isTrue();
        }

        // Circuit breaker should be CLOSED again
        assertThat(breaker.getState())
                .isEqualTo(io.github.resilience4j.circuitbreaker.CircuitBreaker.State.CLOSED);
    }

    private void stubFraudServiceError() {
        String url = "http://localhost:" + wireMock.getMappedPort(8080);
        new RestTemplate().postForEntity(url + "/__admin/mappings",
                Map.of("request", Map.of("method", "POST", "url", "/api/fraud/score"),
                       "response", Map.of("status", 503)),
                String.class);
    }

    private void stubFraudServiceHealthy() {
        String url = "http://localhost:" + wireMock.getMappedPort(8080);
        new RestTemplate().deleteForObject(url + "/__admin/mappings", String.class);
        new RestTemplate().postForEntity(url + "/__admin/mappings",
                Map.of("request", Map.of("method", "POST", "url", "/api/fraud/score"),
                       "response", Map.of("status", 200,
                               "jsonBody", Map.of("score", 0.1, "approved", true))),
                String.class);
    }

    private PaymentRequest samplePayment() {
        return new PaymentRequest("user-1", BigDecimal.valueOf(25.00), "USD");
    }
}

The Observable Signal

The Prometheus metrics exposed by Resilience4J:

# Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)
resilience4j_circuitbreaker_state{name="fraudDetection"} 0

# Calls by outcome
resilience4j_circuitbreaker_calls_seconds_count{name="fraudDetection", kind="successful"}
resilience4j_circuitbreaker_calls_seconds_count{name="fraudDetection", kind="failed"}
resilience4j_circuitbreaker_calls_seconds_count{name="fraudDetection", kind="not_permitted"}

# Failure rate
resilience4j_circuitbreaker_failure_rate{name="fraudDetection"} -1.0
# -1.0 means not enough calls recorded yet (below minimumNumberOfCalls)

The Grafana panel that matters: a single stat panel showing the circuit breaker state, color-coded green/red/amber for CLOSED/OPEN/HALF_OPEN. When this panel turns red, the fraud detection service is confirmed broken. The not_permitted counter shows how many calls were saved by the circuit breaker, each one representing a thread that was not consumed by a failing call.

Alert when resilience4j_circuitbreaker_state{name="fraudDetection"} == 1 (OPEN) for more than 2 minutes. A breaker that opens and closes within a minute or two is handling a transient issue. A breaker that stays open for 5 minutes or more indicates a sustained outage that requires human intervention.