Diagnosing Connection Pool Exhaustion Under Load
Diagnosing Connection Pool Exhaustion Under Load
The Symptom
The telemetry ingestion service starts returning 500 errors when load exceeds 800 requests per second. The MongoDB server is not under stress: CPU at 40%, WiredTiger cache usage at 60%. The application log shows:
com.mongodb.MongoWaitQueueFullException: Timeout waiting for a pooled item
after 5000 MILLISECONDS
The database is healthy. The application is starving for connections.
The Cause
The default maxPoolSize is 100. At 800 req/sec with an average operation duration of 15ms, the pool needs at minimum 12 connections. But during a WiredTiger checkpoint, operation latency spikes to 200ms for 2-3 seconds. During that window:
$$\text{requiredConnections} = 800 \times 0.200 = 160$$
The pool has 100 connections. 60 requests per second enter the wait queue. With the default maxWaitTime of 5 seconds, those requests accumulate rapidly.
The Benchmark
// k6 test: connection pool exhaustion detection
import http from 'k6/http';
import { Trend, Rate } from 'k6/metrics';
const latency = new Trend('req_latency', true);
const errors = new Rate('error_rate');
export const options = {
scenarios: {
ramp_up: {
executor: 'ramping-arrival-rate',
startRate: 100,
timeUnit: '1s',
preAllocatedVUs: 200,
maxVUs: 1000,
stages: [
{ duration: '1m', target: 100 },
{ duration: '1m', target: 300 },
{ duration: '1m', target: 500 },
{ duration: '1m', target: 800 },
{ duration: '1m', target: 1000 },
],
},
},
};
export default function() {
const sensorId = `sensor-${String(Math.floor(Math.random() * 10000)).padStart(5, '0')}`;
const res = http.post(`${__ENV.BASE_URL}/api/telemetry/ingest`, JSON.stringify({
sensorId: sensorId,
timestamp: new Date().toISOString(),
temperature: 20 + Math.random() * 15,
humidity: 40 + Math.random() * 30,
}), { headers: { 'Content-Type': 'application/json' } });
latency.add(res.timings.duration);
errors.add(res.status !== 201);
}
Results with maxPoolSize=100:
| Load (req/sec) | p50 | p95 | p99 | Error rate |
|---|---|---|---|---|
| 100 | 8ms | 15ms | 42ms | 0% |
| 300 | 9ms | 18ms | 55ms | 0% |
| 500 | 12ms | 45ms | 280ms | 0.1% |
| 800 | 15ms | 320ms | 2,800ms | 4.2% |
| 1000 | 18ms | 1,200ms | 5,000ms | 12.8% |
The inflection point is 500 req/sec. Below that, the pool handles the load. Above it, wait queue times dominate the latency.
The Fix
// FAST: Pool sized for peak throughput
MongoClientSettings settings = MongoClientSettings.builder()
.applyConnectionString(new ConnectionString("mongodb://mongo-primary:27017"))
.applyToConnectionPoolSettings(builder -> builder
.maxSize(200)
.minSize(30)
.maxWaitTime(2, TimeUnit.SECONDS)
.maxConnectionIdleTime(5, TimeUnit.MINUTES)
)
.applyToConnectionPoolSettings(builder ->
builder.addConnectionPoolListener(new ConnectionPoolListener() {
@Override
public void connectionCheckedOut(ConnectionCheckedOutEvent event) {
Metrics.counter("mongodb.pool.checkout").increment();
}
@Override
public void connectionCheckOutFailed(ConnectionCheckOutFailedEvent event) {
Metrics.counter("mongodb.pool.checkout.failed",
"reason", event.getReason().name()).increment();
}
})
)
.build();
The Proof
Results with maxPoolSize=200:
| Load (req/sec) | p50 | p95 | p99 | Error rate |
|---|---|---|---|---|
| 100 | 8ms | 14ms | 38ms | 0% |
| 300 | 8ms | 16ms | 45ms | 0% |
| 500 | 9ms | 18ms | 52ms | 0% |
| 800 | 11ms | 25ms | 85ms | 0% |
| 1000 | 14ms | 42ms | 180ms | 0.02% |
p99 at 1,000 req/sec dropped from 5,000ms to 180ms. Error rate dropped from 12.8% to 0.02%.
The Trade-off
200 connections means 200 TCP sockets to the MongoDB server, each consuming approximately 1 MB of memory on the server side (for the connection’s thread stack, input buffer, and authentication state). At 200 connections, that is 200 MB of MongoDB server memory dedicated to connection management. On a server with 32 GB RAM and 24 GB allocated to WiredTiger cache, 200 MB is acceptable. On a smaller instance, it is not. If multiple application instances each open 200 connections, the server connection count adds up quickly. MongoDB’s default maxIncomingConnections is 65,536, but practical limits are lower due to memory and file descriptor constraints.