Thread Pool Exhaustion Under Load

The Symptom

The ride-hailing platform migrated the rider matching service to Spring WebFlux. Performance improved 4x in staging. In production, after two days, the service starts intermittently freezing for 200-500ms. The freezes happen once every 30 seconds, affecting every request that arrives during the freeze window. CPU is at 12%. Memory is stable. No GC pauses. No error logs.

The Cause

A blocking JDBC call is executing on the Netty event loop thread. The event loop has 4 threads (one per CPU core). When one thread blocks on JDBC for 200ms, 25% of the service’s request-handling capacity is gone. All requests routed to that thread queue until the blocking call returns.

// BOTTLENECK: Blocking JDBC call on the reactive event loop
@RestController
public class RideMatchingController {

    private final DriverRepository driverRepository;  // Spring Data JPA (blocking)
    private final ReactiveRedisTemplate<String, String> redisTemplate;

    @PostMapping("/api/rides/match")
    public Mono<DriverMatch> matchDriver(@RequestBody RideRequest request) {
        return redisTemplate.opsForGeo()
            .search("driver:locations",
                GeoReference.fromCoordinate(request.lng(), request.lat()),
                new Distance(5, Metrics.KILOMETERS))
            .collectList()
            .map(nearbyDrivers -> {
                // THIS LINE IS THE PROBLEM
                // driverRepository is Spring Data JPA - it blocks
                List<Driver> available = driverRepository
                    .findAvailableByIds(  // JDBC call on event loop thread!
                        nearbyDrivers.stream()
                            .map(r -> r.getContent().getName())
                            .toList()
                    );
                return selectBestMatch(available, request);
            });
    }
}

The code looks reactive because it starts with a Mono chain. But driverRepository.findAvailableByIds() is a Spring Data JPA method. JPA uses JDBC. JDBC is blocking. The .map() operator executes the provided function on the thread that emitted the upstream signal, which is a Netty event loop thread. A blocking call inside .map() blocks the event loop.

The fix is not to avoid JPA. The fix is to offload the blocking call to a thread pool designed for blocking work.

The Baseline

Monitoring the event loop threads reveals the problem:

# Netty event loop threads in TIMED_WAITING state (should be 0)
jvm_threads_states_threads{state="timed-waiting", thread_name=~"reactor-http-nio-.*"}

# Reactor scheduler pending tasks (should be near 0)
reactor_scheduler_tasks_pending{scheduler="boundedElastic"}

Locust test with the blocking call on the event loop:

At 500 concurrent users:
  /api/rides/match  p50: 45ms  p99: 850ms  RPS: 420  Fail: 0.0%

  But every ~30 seconds, a burst of requests shows p99 > 2000ms
  The intermittent spike correlates with the JDBC call timing

The p99 of 850ms is acceptable, but the intermittent spikes to 2,000ms+ are not. They happen when the JDBC call coincides with a burst of incoming requests on the same event loop thread.

The Fix

Schedulers.boundedElastic() for Blocking Calls

Schedulers.boundedElastic() is a thread pool designed for blocking I/O. It creates threads on demand (up to a configurable maximum, default 10 × CPU cores), keeps idle threads for 60 seconds, and bounds the queue to 100,000 tasks.

// SCALED: Offload blocking JDBC to boundedElastic scheduler
@RestController
public class RideMatchingController {

    private final DriverRepository driverRepository;
    private final ReactiveRedisTemplate<String, String> redisTemplate;

    @PostMapping("/api/rides/match")
    public Mono<DriverMatch> matchDriver(@RequestBody RideRequest request) {
        return redisTemplate.opsForGeo()
            .search("driver:locations",
                GeoReference.fromCoordinate(request.lng(), request.lat()),
                new Distance(5, Metrics.KILOMETERS))
            .collectList()
            .flatMap(nearbyDrivers -> {
                List<String> driverIds = nearbyDrivers.stream()
                    .map(r -> r.getContent().getName())
                    .toList();

                // Offload the blocking JDBC call
                return Mono.fromCallable(() ->
                        driverRepository.findAvailableByIds(driverIds))
                    .subscribeOn(Schedulers.boundedElastic());
            })
            .map(available -> selectBestMatch(available, request));
    }
}

The key change: Mono.fromCallable() wraps the blocking call. .subscribeOn(Schedulers.boundedElastic()) moves the subscription (and therefore the blocking call) to a bounded elastic thread. The event loop thread is released immediately.

The Better Fix: R2DBC

If the service is fully reactive, replace Spring Data JPA with Spring Data R2DBC:

// SCALED: Non-blocking database access with R2DBC
@Repository
public interface DriverReactiveRepository
        extends ReactiveCrudRepository<Driver, String> {

    @Query("SELECT * FROM drivers WHERE id IN (:ids) AND status = 'AVAILABLE'")
    Flux<Driver> findAvailableByIds(@Param("ids") List<String> ids);
}
// Generated SQL (same as JPA, but non-blocking execution):
// SELECT * FROM drivers WHERE id IN ($1,$2,$3,...) AND status = 'AVAILABLE'

// SCALED: Fully reactive controller with R2DBC
@RestController
public class RideMatchingController {

    private final DriverReactiveRepository driverRepository;
    private final ReactiveRedisTemplate<String, String> redisTemplate;

    @PostMapping("/api/rides/match")
    public Mono<DriverMatch> matchDriver(@RequestBody RideRequest request) {
        return redisTemplate.opsForGeo()
            .search("driver:locations",
                GeoReference.fromCoordinate(request.lng(), request.lat()),
                new Distance(5, Metrics.KILOMETERS))
            .map(result -> result.getContent().getName())
            .collectList()
            .flatMapMany(driverRepository::findAvailableByIds)
            .collectList()
            .map(available -> selectBestMatch(available, request));
    }
}

No .subscribeOn() needed. No thread pool handoff. The R2DBC driver uses the same event loop model as the Netty HTTP handler. The entire request, from HTTP accept to PostgreSQL query to Redis lookup to HTTP response, executes on event loop threads without blocking.

Monitoring Thread Pool Health

// SCALED: Custom metrics for thread pool visibility
@Configuration
public class SchedulerMetrics {

    @Bean
    public MeterBinder reactorSchedulerMetrics() {
        return registry -> {
            Schedulers.decorateExecutorService(
                Schedulers.boundedElastic(),
                Schedulers.newBoundedElasticScheduler("bounded-elastic",
                    Runtime.getRuntime().availableProcessors() * 10,
                    100_000,
                    60,
                    false,
                    "custom-be")
            );
        };
    }
}

Prometheus queries for thread pool health:

# Event loop threads that are blocked (should ALWAYS be 0)
jvm_threads_states_threads{state="blocked", thread_name=~"reactor-http-nio-.*"}

# Bounded elastic pool utilization
executor_pool_size_threads{name="boundedElastic"}
executor_active_threads{name="boundedElastic"}
executor_queued_tasks{name="boundedElastic"}

The Proof

Locust test comparing three approaches at 500 concurrent users:

1. Blocking JDBC on event loop (BOTTLENECK):
   p50: 45ms   p99: 850ms + intermittent 2000ms spikes
   RPS: 420    Fail: 0.0%

2. JDBC offloaded to boundedElastic (SCALED):
   p50: 52ms   p99: 380ms (no spikes)
   RPS: 680    Fail: 0.0%

3. R2DBC (fully reactive, SCALED):
   p50: 38ms   p99: 210ms
   RPS: 1240   Fail: 0.0%

Delta (option 1 → option 3):
  p99:  850ms → 210ms   (4x improvement, no intermittent spikes)
  RPS:  420 → 1240      (3x throughput)

Option 2 (boundedElastic) is a reasonable interim fix. It eliminates the intermittent spikes and improves throughput 60%. Option 3 (R2DBC) is the correct long-term solution for a fully reactive service. It eliminates the thread pool handoff entirely.

The thread pool is no longer a bottleneck. The connection pool (CH4-S1) and thread pool (this section) are correctly sized. The next bottleneck is the work the system does repeatedly that it should do once: caching, starting in Chapter 5.