Sliding Window Strategies and Slow Call Detection
Sliding Window Strategies and Slow Call Detection
The circuit breaker in Chapter 4 uses a count-based sliding window: it tracks the last N calls regardless of when they occurred. This is appropriate for services with consistent traffic. For services with variable traffic, a time-based window is more appropriate.
Count-Based vs. Time-Based
Count-based window (sliding-window-type: COUNT_BASED): evaluates the last N calls. If the fraud detection service handles 100 calls per second, a window of 100 covers the last 1 second. If it handles 1 call per second, the same window covers the last 100 seconds. The responsiveness of the circuit breaker changes with traffic volume.
Time-based window (sliding-window-type: TIME_BASED): evaluates all calls in the last N seconds. A window of 60 seconds always covers exactly 60 seconds regardless of traffic. At 100 calls/second, that is 6,000 calls. At 1 call/second, that is 60 calls. The statistical significance changes with traffic volume.
For the payment service calling fraud detection, count-based is the correct choice. The traffic is consistent during business hours (the payment platform processes a steady stream of transactions), and you want the circuit breaker to respond based on a fixed sample size rather than a fixed time window.
Use time-based when traffic is bursty. A batch job that sends 10,000 requests in 10 seconds and then nothing for 50 seconds would pollute a count-based window: the breaker would still be evaluating results from the burst long after it ended.
Slow Call Rate Threshold
The failure rate threshold catches hard failures: exceptions, timeouts, HTTP 5xx. It does not catch soft degradation where the dependency responds successfully but slowly.
If fraud detection normally responds in 50ms and starts responding in 4,500ms (just under your 5-second timeout), every call succeeds. The failure rate is 0%. The circuit breaker never opens. But your thread pool is being consumed 90x faster than normal.
The slow call rate threshold catches this:
# PRODUCTION - application.yml
resilience4j:
circuitbreaker:
instances:
fraudDetection:
sliding-window-type: COUNT_BASED
sliding-window-size: 100
failure-rate-threshold: 50
slow-call-rate-threshold: 80
# Open the circuit when 80% of calls are "slow."
# A slow call is defined by slow-call-duration-threshold below.
slow-call-duration-threshold: 500ms
# Any call that takes longer than 500ms is "slow."
# Fraud detection normal p99 is 120ms. A call taking 500ms
# is 4x the expected latency. 80% of calls being 4x slower
# than expected indicates systemic degradation, not tail latency.
minimum-number-of-calls: 20
wait-duration-in-open-state: 60s
permitted-number-of-calls-in-half-open-state: 5
With this configuration, the circuit breaker opens under two conditions:
- 50% of calls fail (exceptions/timeouts)
- 80% of calls exceed 500ms
Condition 2 catches the scenario where the dependency is technically responding but consuming threads 10x longer than expected. The 80% threshold is intentionally high: you do not want to open the circuit because of tail latency. You want to open it because the dependency is systematically slow.
The Combined Prometheus Query
# Alert when circuit breaker opens due to slow calls
resilience4j_circuitbreaker_slow_call_rate{name="fraudDetection"} > 80
AND
resilience4j_circuitbreaker_state{name="fraudDetection"} == 1
This distinguishes between a breaker that opened due to hard failures (dependency crashed) and a breaker that opened due to slow calls (dependency degraded). The response to each is different: a crashed dependency may need a restart, while a degraded dependency may need its own upstream issues resolved. The slow call rate metric gives the on-call engineer the diagnostic information to act.