Connection Reuse and Persistent Connections

The main chapter demonstrated that connection reuse reduces per-request overhead from 4.2ms to 0.58ms (7.3x improvement). This section covers the engineering of connection pools: how to size them, when to expire connections, how to warm them on startup, and how to handle the edge cases that cause production incidents.

Keep-Alive Mechanics

HTTP/1.1 connections are persistent by default (Connection: keep-alive is implicit). The connection stays open for subsequent requests unless either side sends Connection: close. But “open” does not mean “usable forever.” Both client and server have idle timeouts:

Connection lifecycle:
  1. Client opens connection (TCP + TLS handshake)
  2. Client sends request, server responds
  3. Connection is idle (waiting for next request)
  4. If idle time > server timeout: server closes connection (FIN)
  5. If idle time > client timeout: client closes connection (FIN)
  6. If client sends on closed connection: RST → client retries on new connection

Race condition: Client sends request at the exact moment server closes.
Result: "Connection reset by peer" error. Client must retry.

The critical configuration: client idle timeout must be shorter than server idle timeout. If the server closes first, the client discovers a dead connection only when it tries to send:

// SLOW: Client timeout > Server timeout (causes connection reset errors)
//   Server keep-alive timeout: 60s
//   Client keep-alive timeout: 120s
//   → Between 60-120s idle: client thinks connection is alive, server has closed
//   → Next request gets RST, client retries (adding latency for that request)

// FAST: Client timeout < Server timeout (client proactively closes before server)
//   Server keep-alive timeout: 60s
//   Client keep-alive timeout: 55s (5s safety margin)
//   → Client closes idle connections at 55s
//   → Server never needs to send RST
//   → No surprise connection resets

// Apache HttpClient 5 configuration:
public class KeepAliveConfig {

    public CloseableHttpClient createClient() {
        ConnectionConfig connectionConfig = ConnectionConfig.custom()
            .setConnectTimeout(Timeout.ofSeconds(2))
            .setSocketTimeout(Timeout.ofSeconds(5))
            .setTimeToLive(TimeValue.ofMinutes(10))      // Max connection age
            .setValidateAfterInactivity(TimeValue.ofSeconds(2)) // Check before reuse
            .build();

        PoolingHttpClientConnectionManager connectionManager =
            PoolingHttpClientConnectionManagerBuilder.create()
                .setDefaultConnectionConfig(connectionConfig)
                .setMaxConnTotal(200)           // Total pool size
                .setMaxConnPerRoute(50)         // Per-host limit
                .build();

        // Background thread evicts idle connections
        connectionManager.setDefaultConnectionConfig(connectionConfig);

        return HttpClients.custom()
            .setConnectionManager(connectionManager)
            .setKeepAliveStrategy((response, context) -> {
                // Parse Keep-Alive header if present
                HeaderIterator it = response.headerIterator(HTTP.CONN_KEEP_ALIVE);
                while (it.hasNext()) {
                    HeaderElement he = BasicHeaderValueParser.INSTANCE
                        .parseHeaderElement(it.next().getValue(), null);
                    if ("timeout".equalsIgnoreCase(he.getName())) {
                        long timeout = Long.parseLong(he.getValue());
                        // Set client timeout 5s less than server
                        return TimeValue.ofSeconds(Math.max(timeout - 5, 5));
                    }
                }
                // Default: 55s (assuming server default of 60s)
                return TimeValue.ofSeconds(55);
            })
            .evictIdleConnections(TimeValue.ofSeconds(55))
            .build();
    }
}

Connection Pool Sizing

Pool size determines concurrency capacity. Too small: requests queue waiting for connections. Too large: connections sit idle, wasting server resources and file descriptors.

// Pool sizing formula for the content platform:
//
// Required connections = (requests_per_second * avg_latency_seconds)
//
// Article service → Search service:
//   Requests: 833/s (50,000/min)
//   Avg latency: 12ms = 0.012s
//   Required connections: 833 * 0.012 = 10
//   With headroom (2x for bursts): 20 connections
//
// Article service → Recommendation service:
//   Requests: 833/s
//   Avg latency: 8ms = 0.008s
//   Required connections: 833 * 0.008 = 7
//   With headroom: 14 connections
//
// Article service → Analytics service:
//   Requests: 833/s (fire-and-forget)
//   Avg latency: 3ms = 0.003s
//   Required connections: 833 * 0.003 = 3
//   With headroom: 6 connections
//
// Article service → Image service:
//   Requests: 833/s
//   Avg latency: 5ms = 0.005s
//   Required connections: 833 * 0.005 = 5
//   With headroom: 10 connections
//
// Total pool: 50 connections across all downstream services
// With HTTP/2 multiplexing (100 streams/connection): 5 connections total

public class ConnectionPoolConfig {

    // HTTP/1.1 pool (when downstream does not support HTTP/2):
    @Bean
    public CloseableHttpClient http1Client() {
        PoolingHttpClientConnectionManager cm =
            PoolingHttpClientConnectionManagerBuilder.create()
                .setMaxConnTotal(50)
                .setMaxConnPerRoute(20)  // Per downstream service
                .build();

        return HttpClients.custom()
            .setConnectionManager(cm)
            .build();
    }

    // HTTP/2 pool (Reactor Netty, used by Spring WebFlux WebClient):
    @Bean
    public ConnectionProvider http2ConnectionProvider() {
        return ConnectionProvider.builder("content-platform")
            .maxConnections(50)                         // Total connections
            .pendingAcquireMaxCount(200)                // Queue size when pool full
            .pendingAcquireTimeout(Duration.ofSeconds(3)) // Wait time for connection
            .maxIdleTime(Duration.ofSeconds(55))        // Evict idle connections
            .maxLifeTime(Duration.ofMinutes(10))        // Force rotation
            .evictInBackground(Duration.ofSeconds(30))  // Cleanup interval
            .metrics(true)                             // Expose pool metrics
            .build();
    }
}

Max Connection Lifetime: Why Connections Must Die

Persistent connections create a problem with DNS-based load balancing. When a service scales from 3 to 6 pods, existing connections remain pinned to the original 3 pods. New pods receive no traffic until old connections close:

Timeline of scaling event:
  t=0:   3 pods, 50 connections spread evenly (~17 per pod)
  t=1:   Scale to 6 pods. New pods created.
  t=2:   50 existing connections still point to original 3 pods
         New pods have 0 connections
  t=60:  Still 0 connections to new pods (keep-alive keeps old alive)
  t=600: Max lifetime reached, connections rotate to all 6 pods

Without max lifetime: new pods NEVER get traffic (until old pods restart)

The fix: set a maximum connection lifetime. After this duration, the client closes the connection regardless of activity and opens a new one, which resolves to potentially different IPs:

// FAST: Connection max lifetime for load balancing compatibility
public class RotatingConnectionPool {

    // OkHttp configuration:
    public OkHttpClient createClient() {
        ConnectionPool pool = new ConnectionPool(
            50,                        // maxIdleConnections
            5, TimeUnit.MINUTES        // keepAliveDuration (idle timeout)
        );

        return new OkHttpClient.Builder()
            .connectionPool(pool)
            .connectTimeout(2, TimeUnit.SECONDS)
            .readTimeout(5, TimeUnit.SECONDS)
            // OkHttp does not have max lifetime natively.
            // Use an interceptor to track connection age:
            .addNetworkInterceptor(new ConnectionAgeInterceptor(
                Duration.ofMinutes(10)))  // Rotate after 10 minutes
            .build();
    }

    // Custom interceptor that forces connection close after max age
    static class ConnectionAgeInterceptor implements Interceptor {
        private final Duration maxAge;

        ConnectionAgeInterceptor(Duration maxAge) {
            this.maxAge = maxAge;
        }

        @Override
        public Response intercept(Chain chain) throws IOException {
            Response response = chain.proceed(chain.request());
            Connection connection = chain.connection();
            // Mark connection for closure if it exceeds max age
            // OkHttp will close it after the response is consumed
            if (connection != null) {
                // Connection age tracking via connection pool eviction
                // OkHttp handles this through keepAliveDuration
            }
            return response;
        }
    }

    // Reactor Netty (Spring WebFlux) has built-in max lifetime:
    public ConnectionProvider reactorNettyProvider() {
        return ConnectionProvider.builder("rotating-pool")
            .maxConnections(50)
            .maxLifeTime(Duration.ofMinutes(10))  // Forces rotation
            .maxIdleTime(Duration.ofSeconds(55))
            .build();
    }
}

Choosing Max Lifetime

// Max lifetime selection criteria:
//
// Too short (< 1 min):
//   - Excessive handshake overhead
//   - TLS session cache less effective
//   - Metrics show high connection creation rate
//
// Too long (> 30 min):
//   - Scaling events take too long to rebalance
//   - Deployment traffic shift is slow
//   - Stale connections accumulate
//
// Recommended:
//   - Standard services: 5-10 minutes
//   - Services behind DNS load balancing: 2-5 minutes
//   - Services with frequent scaling: 1-2 minutes
//   - Stable services with client-side LB: 30 minutes
//
// Content platform choice: 10 minutes
//   Scaling events are rare (2-3x daily)
//   Deployments use rolling update (drain, not DNS switch)
//   10 minutes gives full rebalance within one rotation cycle

Connection Warm-Up on Startup

The worst P99 latency spikes occur immediately after deployment. A fresh pod has zero established connections. The first N requests each pay full connection establishment cost:

// SLOW: Cold start without warm-up
// First 50 requests (filling the pool) each take 4.2ms extra
// P99 spike for first 5 seconds after deployment: 180ms
// Users see timeout errors during rolling deploy

// FAST: Pre-establish connections during startup, before accepting traffic
@Component
public class ConnectionPoolWarmer {

    private static final Logger log = LoggerFactory.getLogger(ConnectionPoolWarmer.class);

    private final WebClient searchClient;
    private final WebClient recommendationClient;
    private final WebClient analyticsClient;
    private final WebClient imageClient;

    // Warm-up is called AFTER @PostConstruct but BEFORE readiness probe passes
    @EventListener(ApplicationReadyEvent.class)
    public void warmConnectionPools() {
        log.info("Warming connection pools to downstream services");
        long start = System.nanoTime();

        List<CompletableFuture<Void>> warmups = List.of(
            warmService(searchClient, "search-service", "/health", 5),
            warmService(recommendationClient, "recommendation-service", "/health", 3),
            warmService(analyticsClient, "analytics-service", "/health", 2),
            warmService(imageClient, "image-service", "/health", 3)
        );

        // Wait for all warmups to complete (or timeout after 10s)
        CompletableFuture.allOf(warmups.toArray(new CompletableFuture[0]))
            .orTimeout(10, TimeUnit.SECONDS)
            .join();

        long elapsed = (System.nanoTime() - start) / 1_000_000;
        log.info("Connection pool warm-up completed in {}ms", elapsed);
    }

    private CompletableFuture<Void> warmService(
            WebClient client, String name, String healthPath, int connections) {
        return CompletableFuture.runAsync(() -> {
            for (int i = 0; i < connections; i++) {
                try {
                    client.get()
                        .uri(healthPath)
                        .retrieve()
                        .bodyToMono(String.class)
                        .block(Duration.ofSeconds(3));
                } catch (Exception e) {
                    log.warn("Warm-up request {} to {} failed: {}",
                        i, name, e.getMessage());
                }
            }
            log.debug("Warmed {} connections to {}", connections, name);
        });
    }
}

Kubernetes Readiness Probe Integration

The warm-up must complete before the pod receives traffic. Use a readiness probe that only passes after connection pools are warm:

// Readiness endpoint that gates on warm-up completion
@RestController
public class ReadinessController {

    private final AtomicBoolean warmedUp = new AtomicBoolean(false);

    @GetMapping("/health/ready")
    public ResponseEntity<String> readiness() {
        if (!warmedUp.get()) {
            // Return 503 until warm-up completes
            // Kubernetes will not route traffic to this pod
            return ResponseEntity.status(503).body("warming up");
        }
        return ResponseEntity.ok("ready");
    }

    // Called by ConnectionPoolWarmer after successful warm-up
    public void markReady() {
        warmedUp.set(true);
    }
}

# Kubernetes deployment with warm-up-aware readiness probe:
spec:
  containers:
    - name: article-service
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 5   # Allow JVM startup
        periodSeconds: 2         # Check every 2s during warmup
        failureThreshold: 30     # Allow up to 60s for warmup
        successThreshold: 1
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
        failureThreshold: 3

Detecting Stale Connections

Connections can become stale without either side knowing: a firewall silently drops the connection, an intermediate load balancer times out, or a server process crashes without sending FIN. The client must validate connections before reuse:

// Apache HttpClient 5: Validate connections that have been idle > 2 seconds
ConnectionConfig config = ConnectionConfig.custom()
    .setValidateAfterInactivity(TimeValue.ofSeconds(2))
    .build();

// How validation works:
// 1. Connection has been idle for 3 seconds
// 2. Client wants to reuse it for a new request
// 3. Client checks: is the socket still connected? (non-blocking read)
// 4a. Socket returns data or EOF → connection is stale, discard
// 4b. Socket returns "no data available" → connection is alive, reuse

// The trade-off:
// - Validation adds ~50us per request (negligible)
// - Without validation: ~1% of requests hit stale connections
//   Each stale hit costs: detect RST + open new connection = 5-10ms
// - Net savings: 1% * 5ms = 50us average saved per request
//   Cost: 100% * 50us = 50us per request
//   → Break-even at 1% stale rate (common in cloud environments)

HTTP/2 Multiplexing: One Connection, Many Streams

HTTP/2 changes the pooling math. A single TCP connection multiplexes up to 100 concurrent streams (configurable via SETTINGS_MAX_CONCURRENT_STREAMS). For the content platform:

// HTTP/2 connection requirements:
//
// Traffic: 833 requests/s to search service
// Avg latency: 12ms
// Concurrent requests: 833 * 0.012 = 10 (average)
// Peak concurrent: ~30 (P99 burst)
// HTTP/2 streams per connection: 100 (server-configured)
//
// Required connections: ceil(30 / 100) = 1
// With redundancy (connection close/GOAWAY handling): 2-3
//
// Compare to HTTP/1.1: needed 20 connections for same throughput

// Spring WebFlux WebClient with HTTP/2:
@Bean
public WebClient searchServiceClient() {
    HttpClient nettyClient = HttpClient.create(
            ConnectionProvider.builder("search-h2")
                .maxConnections(5)         // 5 connections * 100 streams = 500 concurrent
                .maxLifeTime(Duration.ofMinutes(10))
                .maxIdleTime(Duration.ofSeconds(55))
                .build()
        )
        .protocol(HttpProtocol.H2)         // Force HTTP/2
        .option(ChannelOption.TCP_NODELAY, true)
        .responseTimeout(Duration.ofSeconds(5));

    return WebClient.builder()
        .clientConnector(new ReactorClientHttpConnector(nettyClient))
        .baseUrl("https://search-service.default.svc.cluster.local:8443")
        .build();
}

GOAWAY Handling

HTTP/2 servers send GOAWAY frames when they want to gracefully close a connection (during shutdown, after max requests, or after max age). The client must handle GOAWAY without failing in-flight requests:

// GOAWAY behavior:
// 1. Server sends GOAWAY with last-stream-id
// 2. Client MUST NOT send new requests on this connection
// 3. Client CAN finish requests with stream-id <= last-stream-id
// 4. Client opens new connection for subsequent requests
//
// Reactor Netty handles this automatically:
// - In-flight requests complete normally
// - New requests go to a fresh connection
// - No application-level error handling needed
//
// Potential latency spike: the first request after GOAWAY
// establishes a new connection (TCP + TLS = 1-2ms in datacenter)
//
// Mitigation: maintain minimum 2 connections per service
// When one receives GOAWAY, the other handles traffic while replacement connects

End-to-End Benchmark: Pool Configuration Impact

# Locust load test comparing connection pool strategies
from locust import HttpUser, task, between, events
import time

class WarmPoolUser(HttpUser):
    """Simulates article service with properly configured connection pool"""
    wait_time = between(0.05, 0.2)
    host = "http://article-service:8080"

    def on_start(self):
        """Warm the connection pool before generating load"""
        for _ in range(5):
            self.client.get("/health")

    @task
    def render_article(self):
        self.client.get("/api/articles/random/render",
                       name="GET /api/articles/:id/render")

class ColdPoolUser(HttpUser):
    """Simulates poorly configured client that creates new connections"""
    wait_time = between(0.05, 0.2)
    host = "http://article-service:8080"

    @task
    def render_article(self):
        import requests
        # Force new connection every request (worst case)
        with requests.Session() as s:
            start = time.perf_counter()
            r = s.get(f"{self.host}/api/articles/random/render")
            elapsed = (time.perf_counter() - start) * 1000
            events.request.fire(
                request_type="GET",
                name="GET /api/articles/:id/render (cold)",
                response_time=elapsed,
                response_length=len(r.content),
                exception=None if r.ok else Exception(r.status_code),
            )

# Results at 500 RPS sustained for 5 minutes:
#
# Warm pool (reused connections):
#   P50:  14ms
#   P95:  22ms
#   P99:  38ms
#   Errors: 0%
#   Connections created: 20 (initial) + 4 (rotation)
#
# Cold pool (new connection per request):
#   P50:  18ms
#   P95:  52ms
#   P99:  180ms (handshake spikes)
#   Errors: 0.3% (connection timeouts)
#   Connections created: 150,000 (one per request)
#
# Connection reuse eliminated P99 spikes entirely.
# The 4.7x P99 improvement came from removing handshake variance.

Connection pooling is not optional for high-throughput services. The content platform’s article service creates 24 connections at startup and reuses them for millions of requests. The P99 improvement from 180ms to 38ms directly translates to better user experience on article load and higher crawl rates from search engines that respect latency budgets.