Retry Budgets and System-Level Retry Control
Retry Budgets and System-Level Retry Control
Individual retry configuration is necessary but insufficient. If three instances each retry 3 times, the total retry amplification is 9x. A retry budget caps the total retry rate across the system to a fixed percentage of original traffic.
The Concept
A retry budget says: “the total number of retries across all instances must not exceed 20% of original traffic.” If the service processes 1,000 requests per second, the maximum retry rate is 200 retries per second. If the first instance has used 150 retries, the second instance only has 50 retries available. If retries are exhausted, additional failures are not retried.
// FROM SCRATCH - Local retry budget using a token bucket
public class RetryBudget {
private final AtomicLong tokens;
private final long maxTokens;
private final double refillRate;
private volatile long lastRefillTimestamp;
/**
* @param maxRetries Maximum retry tokens available
* @param refillPerSecond Tokens added per second (based on expected traffic rate)
*/
public RetryBudget(long maxRetries, double refillPerSecond) {
this.tokens = new AtomicLong(maxRetries);
this.maxTokens = maxRetries;
this.refillRate = refillPerSecond;
this.lastRefillTimestamp = System.nanoTime();
}
public boolean tryAcquire() {
refill();
return tokens.getAndUpdate(t -> t > 0 ? t - 1 : 0) > 0;
}
private void refill() {
long now = System.nanoTime();
long elapsed = now - lastRefillTimestamp;
long newTokens = (long) (elapsed / 1_000_000_000.0 * refillRate);
if (newTokens > 0) {
lastRefillTimestamp = now;
tokens.updateAndGet(t -> Math.min(t + newTokens, maxTokens));
}
}
}
Usage in the retry logic:
// FROM SCRATCH - Retry with budget check
@Component
public class BudgetedRetryService {
private final RetryBudget retryBudget;
public BudgetedRetryService() {
// Allow 20 retries per second, burst capacity of 50
// For a service handling 100 rps, this is a 20% retry budget
this.retryBudget = new RetryBudget(50, 20.0);
}
public <T> T executeWithRetry(Supplier<T> action, int maxAttempts) throws Exception {
Exception lastException = null;
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return action.get();
} catch (Exception e) {
lastException = e;
if (attempt < maxAttempts) {
if (!retryBudget.tryAcquire()) {
// Budget exhausted: do not retry, fail immediately
throw new RetryBudgetExhaustedException(
"Retry budget exhausted after attempt " + attempt, e);
}
// Budget allows retry, proceed with backoff
Thread.sleep(calculateBackoff(attempt));
}
}
}
throw lastException;
}
private long calculateBackoff(int attempt) {
long delay = (long) (200 * Math.pow(2, attempt - 1));
return ThreadLocalRandom.current().nextLong(0, delay + 1);
}
}
The retry budget prevents the degenerate case where high error rates trigger high retry rates, which increase load, which increase error rates, which trigger more retries. The budget caps the amplification regardless of the error rate. At 100% error rate, a 20% budget means you send at most 120 requests per second instead of 300 (100 original + 200 retries from 2 retry attempts each).
Resilience4J does not provide a built-in retry budget. This is a gap in the library. The budget must be implemented at the application level, as shown above, or enforced by an infrastructure-level rate limiter on the retry path.