Timeout Discipline
Timeout Discipline
Every network call that does not have an explicit timeout is a thread leak waiting to happen. The operating system’s default TCP timeout is 120 seconds on most Linux distributions. That means an HTTP call with no timeout configured will hold a thread for two minutes if the remote service accepts the connection but never responds. Two minutes per thread. At 100 requests per second, your thread pool is gone in two seconds.
Timeouts are not a resilience pattern. They are the prerequisite for every resilience pattern. A circuit breaker cannot protect you if the calls it wraps take two minutes to fail. A bulkhead’s thread pool fills up with slow calls instead of fast rejections. A retry policy triggers additional slow calls that each hold threads for minutes.
No timeout, no resilience. Start here.
The Three Timeouts
Every HTTP call involves three distinct timeout surfaces:
Connection timeout. How long to wait for the TCP handshake to complete. If the remote host is unreachable (firewall drops packets, DNS resolves but no service is listening), this timeout determines how long you wait to discover that fact. Set it short: 1-3 seconds. A healthy service accepts TCP connections in single-digit milliseconds. If you are waiting more than a second for a TCP handshake, the service is either down or so overloaded that your request will not be processed quickly anyway.
Read timeout (socket timeout). How long to wait for data after the connection is established. This covers the scenario where the remote service accepts the connection, receives your request, and then takes a long time to produce a response. This is the timeout that prevents thread pool exhaustion from slow dependencies.
Overall timeout (request timeout). The total time budget for the entire operation, including connection, TLS handshake, request sending, waiting for the response, and reading the response body. This is the timeout you actually care about from the caller’s perspective.
// Spring Boot RestClient configuration with explicit timeouts
// PRODUCTION
@Configuration
public class RestClientConfig {
@Bean
public RestClient fraudDetectionRestClient() {
return RestClient.builder()
.baseUrl("http://fraud-detection:8080")
.requestFactory(clientHttpRequestFactory(
Duration.ofSeconds(1), // connection timeout
Duration.ofSeconds(2) // read timeout
))
.build();
}
@Bean
public RestClient balanceRestClient() {
return RestClient.builder()
.baseUrl("http://balance-service:8080")
.requestFactory(clientHttpRequestFactory(
Duration.ofSeconds(1), // connection timeout
Duration.ofSeconds(1) // read timeout: balance is fast or broken
))
.build();
}
@Bean
public RestClient paymentGatewayRestClient() {
return RestClient.builder()
.baseUrl("https://api.payment-processor.com")
.requestFactory(clientHttpRequestFactory(
Duration.ofSeconds(2), // connection timeout: external, allow more
Duration.ofSeconds(5) // read timeout: payment processing takes time
))
.build();
}
private ClientHttpRequestFactory clientHttpRequestFactory(
Duration connectTimeout, Duration readTimeout) {
var factory = new SimpleClientHttpRequestFactory();
factory.setConnectTimeout(connectTimeout);
factory.setReadTimeout(readTimeout);
return factory;
}
}
Every RestClient has explicit timeouts. No defaults are relied upon. The comment on each timeout explains why that specific value was chosen for that specific dependency.
Timeout Layering
The diagram shows the critical rule for timeout configuration in a service call chain: outer timeouts must always be greater than inner timeouts plus processing overhead. When the API gateway has a 30-second timeout but the innermost service has no timeout at all, a single hanging call in the deepest layer can consume all timeouts up the chain. The correct configuration uses decreasing timeouts as you move inward: gateway at 8 seconds, payment at 5 seconds, fraud at 2 seconds. This ensures the innermost call fails first, giving each layer time to handle the failure and respond.
The rule: timeout(caller) > timeout(callee) + processing_overhead. Always.
If the payment service sets a 5-second timeout on fraud detection, and fraud detection sets a 3-second timeout on the external scoring API, the payment service’s timeout must account for the time fraud detection spends on its own logic beyond the scoring call. If fraud detection needs 200ms for request parsing, rule evaluation, and response construction, the payment service’s timeout must be at least 3.2 seconds for the fraud detection call. Setting it to exactly 3 seconds means the payment service could time out while fraud detection is constructing a valid response from a successful scoring call.
The From-Scratch Timeout Wrapper
Before configuring any library, understand what a timeout does at the thread level.
// FROM SCRATCH - Timeout wrapper using CompletableFuture
public class TimeoutWrapper {
private final ExecutorService executor;
private final Duration timeout;
public TimeoutWrapper(ExecutorService executor, Duration timeout) {
this.executor = executor;
this.timeout = timeout;
}
/**
* Executes the given supplier with a timeout.
* If the supplier does not complete within the timeout,
* throws TimeoutException. The underlying thread continues
* executing until the supplier completes or the thread is
* interrupted.
*
* This is the fundamental problem with timeout wrappers:
* they do not cancel the work, they abandon the result.
* The thread is still occupied.
*/
public <T> T execute(Supplier<T> supplier) throws TimeoutException {
CompletableFuture<T> future = CompletableFuture.supplyAsync(supplier, executor);
try {
return future.get(timeout.toMillis(), TimeUnit.MILLISECONDS);
} catch (java.util.concurrent.TimeoutException e) {
future.cancel(true); // Sets interrupt flag, does not guarantee cancellation
throw new TimeoutException(
"Operation timed out after " + timeout.toMillis() + "ms");
} catch (ExecutionException e) {
throw new RuntimeException("Operation failed", e.getCause());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("Operation interrupted", e);
}
}
}
What the Scratch Implementation Reveals
The from-scratch implementation exposes a critical limitation: timeouts in Java do not cancel the underlying work.
When future.get(timeout, unit) throws TimeoutException, the CompletableFuture.cancel(true) call sets the interrupt flag on the executing thread. But if that thread is blocked on a socket read (which it is, when waiting for an HTTP response), the interrupt flag does nothing. The InputStream.read() call does not check the interrupt flag. The thread remains blocked until:
- The remote server sends a response
- The socket’s read timeout fires (if configured)
- The TCP connection is reset by the OS timeout (120+ seconds)
This is why socket-level timeouts (read timeout on the HTTP client) are mandatory. The timeout wrapper gives you a fast return to the caller, but it does not free the thread. Only the socket timeout frees the thread.
A timeout wrapper without socket timeouts is a lie. You get a TimeoutException in your calling code, but the thread is still blocked. Your thread pool is still filling up. You have hidden the problem from the caller without solving it.
Timeout Budget Propagation
In a service call chain, each service should know how much time remains in the overall budget. If the API gateway gives the payment service 8 seconds, and the payment service has already spent 3 seconds on fraud detection, only 5 seconds remain for the rest of the call chain.
// FROM SCRATCH - Deadline propagation via HTTP header
@Component
public class DeadlinePropagationFilter implements Filter {
private static final String DEADLINE_HEADER = "X-Request-Deadline";
private static final ThreadLocal<Instant> requestDeadline = new ThreadLocal<>();
@Override
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
HttpServletRequest httpRequest = (HttpServletRequest) request;
String deadlineHeader = httpRequest.getHeader(DEADLINE_HEADER);
if (deadlineHeader != null) {
Instant deadline = Instant.ofEpochMilli(Long.parseLong(deadlineHeader));
requestDeadline.set(deadline);
} else {
// No deadline from upstream: set our own
requestDeadline.set(Instant.now().plusSeconds(8));
}
try {
chain.doFilter(request, response);
} finally {
requestDeadline.remove();
}
}
public static Duration remainingBudget() {
Instant deadline = requestDeadline.get();
if (deadline == null) {
return Duration.ofSeconds(5); // conservative default
}
Duration remaining = Duration.between(Instant.now(), deadline);
if (remaining.isNegative() || remaining.isZero()) {
return Duration.ZERO;
}
return remaining;
}
}
The calling code checks the remaining budget before making a downstream call:
// PRODUCTION - Using deadline propagation in the fraud detection client
@Component
public class FraudDetectionClient {
private final RestClient restClient;
public FraudDetectionClient(@Qualifier("fraudDetectionRestClient") RestClient restClient) {
this.restClient = restClient;
}
public FraudScore score(PaymentRequest request) {
Duration remaining = DeadlinePropagationFilter.remainingBudget();
if (remaining.compareTo(Duration.ofMillis(500)) < 0) {
// Less than 500ms remaining in the budget.
// Fraud detection p50 is 40ms, but under load it could be slower.
// Making the call with insufficient budget wastes a thread.
throw new InsufficientBudgetException(
"Only " + remaining.toMillis() + "ms remaining, skipping fraud check");
}
return restClient.post()
.uri("/api/fraud/score")
.header("X-Request-Deadline",
String.valueOf(Instant.now().plus(remaining).toEpochMilli()))
.body(request)
.retrieve()
.body(FraudScore.class);
}
}
If the remaining budget is too small for a downstream call to complete, skip the call. Do not start work you cannot finish. The thread, the network bandwidth, and the downstream service’s resources are all wasted on a request whose result will be discarded.
Testing Timeout Behavior
// PRODUCTION - Testcontainers test verifying timeout behavior
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Testcontainers
class TimeoutTest {
@Container
static GenericContainer<?> mockFraudService = new GenericContainer<>(
DockerImageName.parse("wiremock/wiremock:latest"))
.withExposedPorts(8080);
@DynamicPropertySource
static void configureProperties(DynamicPropertyRegistry registry) {
registry.add("fraud.service.url", () ->
"http://localhost:" + mockFraudService.getMappedPort(8080));
}
@Autowired
private TestRestTemplate restTemplate;
@Test
void fraudServiceTimeout_returnsErrorWithinBudget() {
// Configure WireMock to delay 10 seconds (longer than our 2s timeout)
// WireMock stub configuration done via HTTP API
stubFraudServiceWithDelay(Duration.ofSeconds(10));
long start = System.nanoTime();
ResponseEntity<String> response = restTemplate.postForEntity(
"/api/payments", samplePayment(), String.class);
long elapsed = Duration.ofNanos(System.nanoTime() - start).toMillis();
// The response should come back within ~2 seconds (the fraud timeout)
// not 10 seconds (the WireMock delay)
assertThat(elapsed).isLessThan(3000);
// The response should indicate a timeout, not a success
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.GATEWAY_TIMEOUT);
}
private void stubFraudServiceWithDelay(Duration delay) {
// WireMock API call to create a stub with fixed delay
String wireMockUrl = "http://localhost:" + mockFraudService.getMappedPort(8080);
new RestTemplate().postForEntity(
wireMockUrl + "/__admin/mappings",
Map.of(
"request", Map.of("method", "POST", "url", "/api/fraud/score"),
"response", Map.of(
"status", 200,
"fixedDelayMilliseconds", delay.toMillis(),
"jsonBody", Map.of("score", 0.5, "approved", true)
)
),
String.class
);
}
private PaymentRequest samplePayment() {
return new PaymentRequest("user-1", BigDecimal.valueOf(100.00), "USD");
}
}
This test proves that the payment service respects its fraud detection timeout. The WireMock container delays its response for 10 seconds. The payment service’s 2-second timeout fires. The test verifies that the response arrives within 3 seconds (2-second timeout plus margin) and returns an error status. If this test takes 10 seconds, your timeout is not working.