Sliding Window Strategies and Slow Call Detection

The circuit breaker in Chapter 4 uses a count-based sliding window: it tracks the last N calls regardless of when they occurred. This is appropriate for services with consistent traffic. For services with variable traffic, a time-based window is more appropriate.

Count-Based vs. Time-Based

Count-based window (sliding-window-type: COUNT_BASED): evaluates the last N calls. If the fraud detection service handles 100 calls per second, a window of 100 covers the last 1 second. If it handles 1 call per second, the same window covers the last 100 seconds. The responsiveness of the circuit breaker changes with traffic volume.

Time-based window (sliding-window-type: TIME_BASED): evaluates all calls in the last N seconds. A window of 60 seconds always covers exactly 60 seconds regardless of traffic. At 100 calls/second, that is 6,000 calls. At 1 call/second, that is 60 calls. The statistical significance changes with traffic volume.

For the payment service calling fraud detection, count-based is the correct choice. The traffic is consistent during business hours (the payment platform processes a steady stream of transactions), and you want the circuit breaker to respond based on a fixed sample size rather than a fixed time window.

Use time-based when traffic is bursty. A batch job that sends 10,000 requests in 10 seconds and then nothing for 50 seconds would pollute a count-based window: the breaker would still be evaluating results from the burst long after it ended.

Slow Call Rate Threshold

The failure rate threshold catches hard failures: exceptions, timeouts, HTTP 5xx. It does not catch soft degradation where the dependency responds successfully but slowly.

If fraud detection normally responds in 50ms and starts responding in 4,500ms (just under your 5-second timeout), every call succeeds. The failure rate is 0%. The circuit breaker never opens. But your thread pool is being consumed 90x faster than normal.

The slow call rate threshold catches this:

# PRODUCTION - application.yml
resilience4j:
  circuitbreaker:
    instances:
      fraudDetection:
        sliding-window-type: COUNT_BASED
        sliding-window-size: 100
        failure-rate-threshold: 50
        slow-call-rate-threshold: 80
        # Open the circuit when 80% of calls are "slow."
        # A slow call is defined by slow-call-duration-threshold below.

        slow-call-duration-threshold: 500ms
        # Any call that takes longer than 500ms is "slow."
        # Fraud detection normal p99 is 120ms. A call taking 500ms
        # is 4x the expected latency. 80% of calls being 4x slower
        # than expected indicates systemic degradation, not tail latency.

        minimum-number-of-calls: 20
        wait-duration-in-open-state: 60s
        permitted-number-of-calls-in-half-open-state: 5

With this configuration, the circuit breaker opens under two conditions:

50% of calls fail (exceptions/timeouts)
80% of calls exceed 500ms

Condition 2 catches the scenario where the dependency is technically responding but consuming threads 10x longer than expected. The 80% threshold is intentionally high: you do not want to open the circuit because of tail latency. You want to open it because the dependency is systematically slow.

The Combined Prometheus Query

# Alert when circuit breaker opens due to slow calls
resilience4j_circuitbreaker_slow_call_rate{name="fraudDetection"} > 80
AND
resilience4j_circuitbreaker_state{name="fraudDetection"} == 1

This distinguishes between a breaker that opened due to hard failures (dependency crashed) and a breaker that opened due to slow calls (dependency degraded). The response to each is different: a crashed dependency may need a restart, while a degraded dependency may need its own upstream issues resolved. The slow call rate metric gives the on-call engineer the diagnostic information to act.