Retry
Retry
Retry is the most dangerous resilience pattern. A retry policy that is correct at the level of a single request becomes destructive at the level of a system. One instance retrying 3 times triples its outgoing request rate to the failing dependency. Three instances retrying 3 times each produce a 9x amplification. If the dependency is slow because of overload, retries make the overload worse. The dependency that was slow becomes unreachable. The retry “strategy” accelerated the failure.
This is a retry storm. It is the second most common cause of cascading failures after thread pool exhaustion, and it is entirely self-inflicted.
The Failure Mode
The payment gateway returns HTTP 503 intermittently. 10% of requests fail. Without retry, 10% of payments fail. With a naive retry of 3 attempts:
- First attempt: 100 requests sent. 10 fail (10%).
- Second attempt: 10 retries sent. 1 fails (10%).
- Third attempt: 1 retry sent. Succeeds (probably).
Total requests to payment gateway: 100 + 10 + 1 = 111. An 11% increase. This looks manageable.
But if the payment gateway’s error rate is caused by overload, the additional 11 requests increase the load, which increases the error rate. At a 50% error rate:
- First attempt: 100 requests. 50 fail.
- Second attempt: 50 retries. 25 fail.
- Third attempt: 25 retries. 12 fail.
Total requests: 100 + 50 + 25 = 175. A 75% increase. The gateway was already overloaded. You just sent 75% more traffic.
With no backoff delay between retries, all retries arrive within milliseconds of the failures. The load spike is concentrated.
The Internals: From Scratch
// FROM SCRATCH - Retry with exponential backoff and full jitter
public class RetryWithBackoff<T> {
private final int maxAttempts;
private final Duration initialInterval;
private final double multiplier;
private final Duration maxInterval;
private final Set<Class<? extends Exception>> retryableExceptions;
private final ThreadLocalRandom random = ThreadLocalRandom.current();
public RetryWithBackoff(int maxAttempts, Duration initialInterval,
double multiplier, Duration maxInterval,
Set<Class<? extends Exception>> retryableExceptions) {
this.maxAttempts = maxAttempts;
this.initialInterval = initialInterval;
this.multiplier = multiplier;
this.maxInterval = maxInterval;
this.retryableExceptions = retryableExceptions;
}
public T execute(Supplier<T> supplier) throws Exception {
Exception lastException = null;
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return supplier.get();
} catch (Exception e) {
lastException = e;
if (!isRetryable(e)) {
throw e; // Non-retryable exception, fail immediately
}
if (attempt == maxAttempts) {
break; // Last attempt, do not sleep
}
Duration backoff = calculateBackoff(attempt);
// Full jitter: random value between 0 and the calculated backoff
// This decorrelates retry timing across multiple callers
long jitteredDelay = random.nextLong(0, backoff.toMillis() + 1);
try {
Thread.sleep(jitteredDelay);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new RuntimeException("Retry interrupted", ie);
}
}
}
throw lastException;
}
private Duration calculateBackoff(int attempt) {
// Exponential: initialInterval * multiplier^(attempt-1)
// Capped at maxInterval
double delay = initialInterval.toMillis() * Math.pow(multiplier, attempt - 1);
long cappedDelay = Math.min((long) delay, maxInterval.toMillis());
return Duration.ofMillis(cappedDelay);
}
private boolean isRetryable(Exception e) {
return retryableExceptions.stream()
.anyMatch(clazz -> clazz.isInstance(e));
}
}
Usage against the payment gateway:
// FROM SCRATCH - Usage
RetryWithBackoff<PaymentConfirmation> retry = new RetryWithBackoff<>(
3, // 3 attempts total (1 original + 2 retries)
Duration.ofMillis(200), // initial delay: 200ms
2.0, // multiplier: 200ms, 400ms, 800ms
Duration.ofSeconds(2), // max delay cap
Set.of(IOException.class, // network errors
HttpServerErrorException.class) // 5xx responses
);
PaymentConfirmation confirmation = retry.execute(() ->
paymentGateway.charge(request) // each call has its own 5s timeout
);
What the Scratch Implementation Reveals
The diagram shows three retry strategies. Fixed delay (red, dashed) retries at the same interval every time. When 1,000 clients all fail at the same moment and retry after exactly 1 second, 1,000 retries arrive simultaneously. This is the thundering herd. Exponential backoff without jitter (orange) spreads retries over time, but all clients that failed at the same moment still retry at the same moments: 1s, 2s, 4s, 8s. They are synchronized. Exponential backoff with full jitter (green dots) randomizes each retry delay between 0 and the exponential value. The retries spread out. No two clients are synchronized. The load on the recovering dependency is distributed over time.
Full jitter is mandatory, not optional. Without jitter, exponential backoff concentrates retries at power-of-two intervals. With 1,000 concurrent failures and a 1-second base, you get 1,000 retries at T=1s, then 1,000 retries at T=2s, then 1,000 at T=4s. The bursts are smaller per unit time, but they are still bursts. Full jitter eliminates the bursts entirely.
The retryable exception set is a security boundary. Retrying a 400 Bad Request is pointless: the request is malformed and will fail again. Retrying a 401 Unauthorized is pointless and potentially dangerous: it hammers the auth service. Retrying a 409 Conflict is dangerous: the operation may have partially succeeded. Only retry exceptions that indicate a transient failure that is likely to resolve on the next attempt.
The Production Implementation
# PRODUCTION - application.yml
resilience4j:
retry:
instances:
paymentGateway:
max-attempts: 3
# Total attempts including the first call.
# 3 means: 1 original + 2 retries.
wait-duration: 200ms
# Base wait duration before first retry.
# Short enough that the total retry time stays within the
# caller's timeout budget.
enable-exponential-backoff: true
exponential-backoff-multiplier: 2.0
# Wait durations: ~200ms, ~400ms (with jitter applied on top)
enable-randomized-wait: true
randomized-wait-factor: 0.5
# Adds random jitter of +/- 50% to each wait duration.
# 200ms base becomes 100-300ms. 400ms becomes 200-600ms.
retry-exceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
# Retry on network errors, timeouts, and 5xx responses.
ignore-exceptions:
- org.springframework.web.client.HttpClientErrorException
# Never retry 4xx responses. They are not transient.
retry-on-result-predicate: com.txn.payment.predicate.RetryOnServerError
# Custom predicate to retry on specific response conditions
// PRODUCTION - Retry with annotation
@Service
public class PaymentGatewayService {
private final PaymentGatewayClient gatewayClient;
@io.github.resilience4j.retry.annotation.Retry(
name = "paymentGateway", fallbackMethod = "paymentFallback")
public PaymentConfirmation charge(PaymentRequest request) {
return gatewayClient.charge(request);
}
private PaymentConfirmation paymentFallback(PaymentRequest request, Throwable cause) {
throw new PaymentProcessingException(
"Payment failed after 3 attempts: " + cause.getMessage(), cause);
// No silent fallback for payments. If the gateway is down after retries,
// the payment fails and the user is notified.
}
}
The Test
// PRODUCTION - Test proving retry behavior and backoff timing
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Testcontainers
class RetryIntegrationTest {
@Container
static GenericContainer<?> wireMock = new GenericContainer<>(
DockerImageName.parse("wiremock/wiremock:latest"))
.withExposedPorts(8080);
@DynamicPropertySource
static void configureProperties(DynamicPropertyRegistry registry) {
registry.add("payment.gateway.url", () ->
"http://localhost:" + wireMock.getMappedPort(8080));
}
@Autowired
private PaymentGatewayService gatewayService;
@Test
void retries_thenSucceeds() {
// First two calls return 503, third succeeds
stubSequentialResponses(503, 503, 200);
PaymentConfirmation result = gatewayService.charge(samplePayment());
assertThat(result).isNotNull();
// Verify WireMock received exactly 3 requests
assertThat(getRequestCount("/api/charge")).isEqualTo(3);
}
@Test
void allRetriesFail_fallbackInvoked() {
stubAllFailures(503);
assertThatThrownBy(() -> gatewayService.charge(samplePayment()))
.isInstanceOf(PaymentProcessingException.class)
.hasMessageContaining("after 3 attempts");
assertThat(getRequestCount("/api/charge")).isEqualTo(3);
}
@Test
void clientError_notRetried() {
stubAllFailures(400); // Bad Request - not retryable
assertThatThrownBy(() -> gatewayService.charge(samplePayment()))
.isInstanceOf(HttpClientErrorException.class);
// Should have only 1 request - no retries for 4xx
assertThat(getRequestCount("/api/charge")).isEqualTo(1);
}
}
The Observable Signal
# Retry metrics
resilience4j_retry_calls_total{name="paymentGateway", kind="successful_without_retry"}
resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}
resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}
resilience4j_retry_calls_total{name="paymentGateway", kind="failed_without_retry"}
The metric that deserves a Grafana panel: successful_with_retry / (successful_without_retry + successful_with_retry). This is the “retry-assisted success rate.” When it is 0%, retries are not needed. When it is 5%, retries are earning their keep: 5% of requests needed a second attempt. When it is 40%, the dependency is unreliable and retries are masking a problem that needs investigation.
Alert when failed_with_retry rate exceeds 1% sustained over 5 minutes. This means retries are exhausting and failing, and the dependency requires attention.