The Mechanics of Thread Pool Exhaustion

Thread pool exhaustion is the kill mechanism in most cascading failures. Understanding exactly how it happens, at the level of individual threads and queue entries, is prerequisite knowledge for every resilience pattern that follows.

How Spring Boot Manages Threads

Spring Boot’s embedded Tomcat server creates a thread pool at startup. The default configuration:

server:
  tomcat:
    threads:
      max: 200 # Maximum worker threads
      min-spare: 10 # Minimum idle threads kept alive
    max-connections: 8192 # Maximum concurrent connections (NIO)
    accept-count: 100 # Queue size when all threads are busy

When a request arrives, Tomcat assigns it a worker thread from the pool. That thread is occupied for the entire duration of the request. It cannot serve another request until the current one completes. This is the blocking I/O model.

If all 200 threads are busy, the next request goes into the accept queue (capacity 100). If the accept queue is full, the connection is refused. The client sees a connection reset or timeout.

Little’s Law

The relationship between concurrency, throughput, and latency is described by Little’s Law:

L = λ * W

Where:

L = average number of concurrent requests (threads in use)
λ = average arrival rate (requests per second)
W = average time each request spends in the system (latency in seconds)

For the payment service under normal conditions:

λ = 100 requests/second
W = 0.1 seconds (100ms average response time)
L = 100 * 0.1 = 10 concurrent threads

Ten threads. The pool of 200 is 95% idle. Comfortable.

When fraud detection degrades:

λ = 100 requests/second (unchanged, clients keep sending)
W = 5.1 seconds (fraud detection now takes 5s instead of 50ms)
L = 100 * 5.1 = 510 concurrent threads

The pool has 200 threads. You need 510. The deficit is immediate.

Detecting Exhaustion Before It Kills You

Micrometer, which ships with Spring Boot Actuator, exposes Tomcat thread pool metrics automatically:

// PRODUCTION - Custom metrics for thread pool monitoring
@Component
public class ThreadPoolMetrics {

    private final MeterRegistry registry;

    public ThreadPoolMetrics(MeterRegistry registry) {
        this.registry = registry;
    }

    @EventListener(ApplicationReadyEvent.class)
    public void registerMetrics() {
        // These are registered automatically by Spring Boot Actuator,
        // but knowing what they measure matters:
        // tomcat_threads_busy_threads - currently executing request threads
        // tomcat_threads_current_threads - total threads (busy + idle)
        // tomcat_threads_config_max_threads - configured maximum

        // Custom gauge: thread pool utilization as a percentage
        Gauge.builder("tomcat.threads.utilization", this, self -> {
            double busy = registry.get("tomcat.threads.busy").gauge().value();
            double max = registry.get("tomcat.threads.config.max").gauge().value();
            return (busy / max) * 100.0;
        }).description("Thread pool utilization percentage")
          .register(registry);
    }
}

The alert threshold that matters: tomcat_threads_busy_threads / tomcat_threads_config_max_threads > 0.7. At 70% utilization, you have approximately 60 threads left. If a dependency degrades, those 60 threads buy you seconds, not minutes. The alert gives your on-call engineer a chance to act before the pool fills.

# Prometheus alerting rule
groups:
  - name: thread_pool
    rules:
      - alert: ThreadPoolNearExhaustion
        expr: >
          tomcat_threads_busy_threads
          / tomcat_threads_config_max_threads
          > 0.7
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "Thread pool utilization above 70% for 30s"
          description: "Service {{ $labels.instance }} has {{ $value }}% thread utilization"

Do not set this threshold at 90%. By 90%, the accept queue is likely filling, and you have single-digit seconds before user-visible impact.

The Accept Queue Trap

When all 200 threads are busy, requests enter the accept queue. The queue has a default capacity of 100 connections. The client’s request is accepted at the TCP level (the TCP handshake completes, the connection is established), but no thread is available to process it. The request sits in the queue, consuming a connection slot, contributing nothing.

From the client’s perspective, the connection was accepted. The client is now waiting for a response. The client’s timeout clock is running. If the client has a 5-second timeout and the request sits in the accept queue for 4 seconds before a thread becomes available, only 1 second remains for actual processing.

If a request sits in the accept queue for longer than the client’s timeout, the client gives up and sends a new request (or the user retries). That new request enters the queue behind the expired one. The thread that eventually picks up the expired request does the work, produces a response, and sends it to a client that already disconnected. The thread did real work for no value. This is wasted work, and it makes recovery harder because threads are spending time on requests that no one is waiting for.

The fix is to set server.tomcat.accept-count to a small number and let excess connections be refused immediately rather than queued:

server:
  tomcat:
    accept-count: 10 # EXPLICIT AND INTENTIONAL
    # Small queue means connection refused under load
    # Connection refused is a fast, clear signal to the client
    # A long queue is a slow, ambiguous signal that wastes resources

A connection refused is a gift to the client. It takes milliseconds and the client can retry immediately, potentially against a different instance. A queued request that times out takes seconds and wastes resources on both sides.