Skip to main content
surviving the spike

100k Concurrent Connections: Memory, File Descriptors, and Redis Pub/Sub

11 min read Chapter 36 of 66

100k Concurrent Connections: Memory, File Descriptors, and Redis Pub/Sub

The Symptom

The ride-hailing platform’s SSE driver location service runs on 4 pods. Each pod holds 12,000 connections. At 48,000 total connections, the service handles Friday evening peak. Then the product team launches in a second city. The connection count climbs to 65,000. Pod 3 crashes with java.io.IOException: Too many open files. Pod 2 follows 40 seconds later. The remaining pods absorb the reconnecting clients from the crashed pods, climb to 32,000 connections each, and run out of memory. All four pods restart within 90 seconds. 65,000 riders lose their driver tracking.

The pod’s file descriptor limit was 65,536 (the Linux default). Each SSE connection holds one file descriptor. The pod also needs file descriptors for Redis connections, log files, JVM internals, and the Netty event loop. At 12,000 SSE connections, the pod was at 12,400 total file descriptors. At 17,000 connections (after absorbing a crashed pod’s clients), it crossed the limit.

The Cause

Three resources constrain concurrent connection count: file descriptors, memory, and CPU. For SSE connections streaming driver locations, memory is the binding constraint.

Memory per SSE connection

Each SSE connection in Spring WebFlux holds:

Component                          Memory
Netty Channel + pipeline           ~2.5KB
Reactor Flux subscription          ~1.2KB
SSE output buffer (default 8KB)    ~8.0KB
DriverLocation object              ~0.3KB
Redis subscription reference       ~0.1KB
Micrometer metrics tags            ~0.2KB
                            Total: ~12.3KB

The SSE output buffer dominates. Netty allocates an 8KB direct byte buffer per channel for outgoing data. For 100,000 connections:

$$100{,}000 \times 12.3\text{KB} = 1.23\text{GB}$$

Add JVM overhead (metaspace, GC metadata, thread stacks for the Netty event loop):

Memory allocation breakdown per pod showing SSE connections dominating at 1.23GB, with JVM overhead, Redis, Netty, and GC headroom totaling 2.3GB

SSE connections consume over half the pod’s memory budget. The 30% GC headroom is not optional — without it, a minor allocation spike during garbage collection will trigger an OOM kill. At 70% utilization target, each pod needs 3.3GB allocated to safely hold 20,000 connections.

To hold 100,000 connections on a single pod, the pod needs at least 2.3GB of heap. Running at 100% capacity with no headroom guarantees OOM on the next GC pause. Target 70% utilization: 100,000 connections requires 3.3GB allocated memory.

The practical limit: run 5 pods at 20,000 connections each rather than 2 pods at 50,000 each. Smaller pods are cheaper to replace, faster to drain during deployments, and less damaging when one crashes.

File descriptors

Linux processes have a default file descriptor limit of 1,024 (soft) and 65,536 (hard). Each SSE connection consumes one file descriptor. The pod needs additional descriptors for:

File descriptor allocation showing 20,424 total FDs needed, with 98% consumed by SSE connections, and the default soft limit of 1,024 far below the requirement

The default Linux soft limit of 1,024 file descriptors will kill the pod at just 974 SSE connections. The hard limit of 65,536 provides enough headroom, but the soft limit must be raised explicitly. SSE connections account for 98% of all file descriptors — every other source combined is a rounding error.

The soft limit of 1,024 will kill the pod at 974 SSE connections. The hard limit of 65,536 provides headroom for 20,000 connections. Set the soft limit to match the hard limit:

# /etc/security/limits.conf (on the container base image)
*    soft    nofile    131072
*    hard    nofile    131072

Or in the Kubernetes pod security context:

securityContext:
  sysctls:
    - name: net.core.somaxconn
      value: "32768"

With the container runtime setting:

# In the Dockerfile
RUN echo "* soft nofile 131072" >> /etc/security/limits.conf && \
echo "* hard nofile 131072" >> /etc/security/limits.conf

The Baseline

Current state before scaling work:

Metric                    Value
Pods                      4
Connections per pod       12,000 (peak)
Total connections         48,000
Memory per pod            1.5GB allocated, 1.2GB used
File descriptor limit     65,536 (hard)
File descriptors used     12,400
Redis Pub/Sub channels    ~8,000 (unique drivers being tracked)
Pod restart recovery      60-90 seconds

Target state:

Metric                    Value
Pods                      5
Connections per pod       20,000 (target, 25,000 limit)
Total connections         100,000
Memory per pod            3.5GB allocated
File descriptor limit     131,072
Redis Pub/Sub channels    ~15,000
Pod restart recovery      < 10 seconds (graceful drain)

The Fix

Redis Pub/Sub for cross-instance broadcasting

When a driver sends a location update, it hits one pod. The 500 riders tracking that driver are distributed across all 5 pods. Redis Pub/Sub broadcasts the update:

// SCALED: Driver location ingestion with Redis Pub/Sub broadcast
@RestController
public class DriverLocationIngestController {

    private final ReactiveRedisTemplate<String, DriverLocation> redisTemplate;
    private final MeterRegistry meterRegistry;

    @PostMapping("/api/drivers/location")
    public Mono<Void> updateLocation(@RequestBody DriverLocation location) {
        String channel = "driver:location:" + location.driverId();

        return redisTemplate.convertAndSend(channel, location)
            .doOnSuccess(receivers ->
                meterRegistry.counter("driver.location.published",
                    "driver", location.driverId()).increment()
            )
            .then();
    }
}

Each pod subscribes to Redis channels for the drivers its local riders are tracking:

// SCALED: Per-pod Redis subscription management
@Service
public class DriverLocationSubscriptionManager {

    private final ReactiveRedisTemplate<String, DriverLocation> redisTemplate;
    private final ConcurrentHashMap<String, Disposable> subscriptions =
        new ConcurrentHashMap<>();
    private final ConcurrentHashMap<String, Set<FluxSink<DriverLocation>>> listeners =
        new ConcurrentHashMap<>();

    public Flux<DriverLocation> subscribe(String driverId) {
        return Flux.create(sink -> {
            listeners.computeIfAbsent(driverId, k -> ConcurrentHashMap.newKeySet())
                .add(sink);

            // Subscribe to Redis channel if this is the first listener for this driver
            subscriptions.computeIfAbsent(driverId, k -> {
                String channel = "driver:location:" + driverId;
                return redisTemplate.listenToChannel(channel)
                    .map(ReactiveSubscription.Message::getMessage)
                    .subscribe(location -> {
                        Set<FluxSink<DriverLocation>> sinks = listeners.get(driverId);
                        if (sinks != null) {
                            sinks.forEach(s -> s.next(location));
                        }
                    });
            });

            sink.onDispose(() -> {
                Set<FluxSink<DriverLocation>> sinks = listeners.get(driverId);
                if (sinks != null) {
                    sinks.remove(sink);
                    if (sinks.isEmpty()) {
                        listeners.remove(driverId);
                        Disposable sub = subscriptions.remove(driverId);
                        if (sub != null) sub.dispose();
                    }
                }
            });
        });
    }
}

The subscription manager tracks how many local riders are watching each driver. When the first rider on this pod starts tracking driver-123, the pod subscribes to driver:location:driver-123 on Redis. When the last rider stops tracking, the pod unsubscribes. This prevents subscribing to channels nobody on this pod cares about.

The fan-out problem

One driver location update triggers fan-out to all riders tracking that driver. During a concert let-out, 500 riders might track drivers in the same area. A single driver update publishes to Redis once. Redis delivers it to all 5 pods. Each pod pushes it to its ~100 local riders tracking that driver.

The math:

1 driver update → 1 Redis publish
1 Redis publish → 5 pod deliveries (1 per subscriber)
5 pod deliveries → 500 SSE writes (100 per pod)

Rate: 3 updates/second/driver × 200 active drivers in zone
     = 600 Redis publishes/second
     = 3,000 pod deliveries/second
     = 300,000 SSE writes/second across all pods

300,000 SSE writes per second is the peak fan-out during a concert let-out. Each write is ~200 bytes. Total throughput: 60MB/s across 5 pods, 12MB/s per pod. Netty handles this on 4 event loop threads.

Redis Pub/Sub limitation: messages are fire-and-forget. If a pod is slow processing messages, Redis does not buffer them. For location updates, a missed message means the rider’s map shows a position that is 300ms stale. The next update corrects it. For ride acceptance messages, this is unacceptable, which is why ride acceptance uses a separate WebSocket connection with application-level acknowledgments and a database-backed retry queue.

Connection draining during deployments

A rolling deployment terminates pods one at a time. Each pod holds 20,000 SSE connections. If Kubernetes kills the pod immediately, 20,000 riders lose their driver tracking and reconnect to the remaining pods, spiking their connection count by 5,000 each.

Graceful draining sends a custom SSE event telling clients to reconnect before the pod terminates:

// SCALED: Graceful connection draining on shutdown
@Component
public class ConnectionDrainer {

    private final List<FluxSink<ServerSentEvent<String>>> activeSinks =
        new CopyOnWriteArrayList<>();

    public void registerSink(FluxSink<ServerSentEvent<String>> sink) {
        activeSinks.add(sink);
        sink.onDispose(() -> activeSinks.remove(sink));
    }

    @PreDestroy
    public void drainConnections() {
        log.info("Draining {} SSE connections", activeSinks.size());

        ServerSentEvent<String> reconnectEvent = ServerSentEvent.<String>builder()
            .event("reconnect")
            .data("server-shutdown")
            .retry(Duration.ofMillis(100))
            .build();

        // Send reconnect event to all connections
        activeSinks.forEach(sink -> {
            try {
                sink.next(reconnectEvent);
            } catch (Exception e) {
                // Connection already closed
            }
        });

        // Wait for clients to reconnect to other pods
        try {
            Thread.sleep(5000);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }

        // Close remaining connections
        activeSinks.forEach(FluxSink::complete);
        log.info("Connection drain complete");
    }
}

The client handles the reconnect event:

source.addEventListener("reconnect", () => {
  source.close();
  // Small random delay to prevent thundering herd
  const delay = Math.random() * 2000;
  setTimeout(() => connectToDriver(driverId), delay);
});

The Math.random() * 2000 spreads reconnections over 2 seconds. Without jitter, 20,000 clients reconnect simultaneously, overwhelming the remaining pods. The retry(Duration.ofMillis(100)) in the SSE event is a fallback: if the client does not handle the reconnect event, the browser’s built-in EventSource reconnection fires after 100ms.

Kubernetes deployment manifest

# SCALED: Kubernetes Deployment for SSE driver location service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: driver-location-sse
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0 # Never reduce capacity during deploy
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: sse-service
          image: ridehail/driver-location-sse:latest
          resources:
            requests:
              memory: "2.5Gi"
              cpu: "2"
            limits:
              memory: "3.5Gi"
              cpu: "4"
          env:
            - name: JAVA_OPTS
              value: >-
                -Xmx2g -Xms2g
                -XX:MaxDirectMemorySize=512m
                -XX:+UseZGC
                -XX:+ZGenerational
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - "curl -s -X POST localhost:8080/actuator/drain && sleep 10"
      initContainers:
        - name: sysctl-init
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              sysctl -w net.core.somaxconn=32768
              sysctl -w net.ipv4.ip_local_port_range="1024 65535"
          securityContext:
            privileged: true
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: driver-location-sse-pdb
spec:
  minAvailable: 4
  selector:
    matchLabels:
      app: driver-location-sse

Key configuration decisions:

  1. maxUnavailable: 0 ensures no capacity reduction during deploys. The new pod starts, passes readiness, and absorbs connections before the old pod drains.

  2. terminationGracePeriodSeconds: 30 gives 30 seconds for the drain sequence: send reconnect events, wait 5 seconds, close connections, JVM shutdown.

  3. -XX:+UseZGC -XX:+ZGenerational uses ZGC for sub-millisecond GC pauses. With 20,000 live objects (one per connection), a stop-the-world GC pause would freeze all 20,000 streams. ZGC’s concurrent collection avoids this.

  4. -XX:MaxDirectMemorySize=512m caps Netty’s direct byte buffers. Without this, Netty allocates direct memory without bound and the pod OOMs outside the JVM heap.

  5. PodDisruptionBudget: minAvailable: 4 prevents cluster operations (node drain, scaling) from killing more than one pod at a time. At 5 replicas, one pod down means 20,000 connections redistribute to 4 pods. Two pods down means 40,000 connections redistribute to 3 pods, which is too close to the 25,000-per-pod limit.

  6. initContainers sysctl raises somaxconn (TCP connection backlog) and expands the ephemeral port range. The default somaxconn of 128 causes connection drops during reconnection storms.

Locust simulation: 100k SSE connections

# load-tests/sse_100k_locustfile.py
import json
import sseclient
import requests
from locust import User, task, between, events
from locust.runners import MasterRunner

class SSEUser(User):
    wait_time = between(60, 120)  # Stay connected, reconnect on failure

    def on_start(self):
        self.driver_id = f"driver-{self.environment.runner.user_count % 500}"
        self.connect()

    def connect(self):
        try:
            url = f"{self.host}/api/sse/drivers/{self.driver_id}/location"
            response = requests.get(url, stream=True, timeout=120)
            self.client_sse = sseclient.SSEClient(response)

            for event in self.client_sse.events():
                if event.event == 'location':
                    data = json.loads(event.data)
                    events.request.fire(
                        request_type="SSE",
                        name="/sse/driver/location",
                        response_time=0,
                        response_length=len(event.data),
                        exception=None,
                        context={}
                    )
                elif event.event == 'reconnect':
                    break

        except Exception as e:
            events.request.fire(
                request_type="SSE",
                name="/sse/driver/location",
                response_time=0,
                response_length=0,
                exception=e,
                context={}
            )

    @task
    def maintain_connection(self):
        # Reconnect if disconnected
        self.connect()

Run with distributed Locust across 10 workers:

# Master
locust -f load-tests/sse_100k_locustfile.py \
    --master \
    --host=http://sse-service.ridehail.svc.cluster.local:8080

# Workers (10 instances, each manages 10k connections)
locust -f load-tests/sse_100k_locustfile.py \
    --worker \
    --master-host=locust-master

10 workers, each holding 10,000 SSE connections, simulating 100,000 concurrent riders tracking 500 unique drivers.

The Proof

Resource consumption at 100,000 concurrent SSE connections across 5 pods:

Metric                  Per Pod       Total (5 pods)
SSE connections         20,000        100,000
Memory used             2.1GB         10.5GB
Memory allocated        3.5GB         17.5GB
CPU (avg)               1.8 cores     9 cores
CPU (peak fan-out)      3.2 cores     16 cores
File descriptors        20,450        102,250
Redis Pub/Sub channels  15,000        15,000 (shared)
Event delivery p99      38ms          38ms

Comparison with the polling baseline:

MetricPolling (25k RPS)SSE (100k conn)Delta
CPU12 cores9 cores-25%
Memory8GB (JVM + conn pool)17.5GB+119%
Bandwidth2.8TB/month180GB/month-94%
Update latency (p99)2,000ms38ms-98%
Active data efficiency15%100%+567%

SSE uses more memory because it holds connections open. It uses less CPU because there is no per-request overhead. It uses far less bandwidth because updates only send when data changes. The update latency drops from 2 seconds (poll interval) to 38ms (Redis Pub/Sub propagation + SSE write).

The memory increase is the trade. 17.5GB of RAM costs $35/month on cloud infrastructure. The 94% bandwidth reduction saves $280/month. The 98% latency improvement is worth more than both: riders see drivers moving in real-time, not jumping between positions every 2 seconds.

Deployment drain test: terminate one pod during peak load. Before drain logic, 20,000 connections reconnect simultaneously within 100ms, spike the remaining pods by 5,000 connections each in a thundering herd. After drain logic with jitter, connections redistribute over 2 seconds. No pod exceeds 22,000 connections during the transition. No rider sees more than 200ms of tracking interruption.