100k Concurrent Connections: Memory, File Descriptors, and Redis Pub/Sub
100k Concurrent Connections: Memory, File Descriptors, and Redis Pub/Sub
The Symptom
The ride-hailing platform’s SSE driver location service runs on 4 pods. Each pod holds 12,000 connections. At 48,000 total connections, the service handles Friday evening peak. Then the product team launches in a second city. The connection count climbs to 65,000. Pod 3 crashes with java.io.IOException: Too many open files. Pod 2 follows 40 seconds later. The remaining pods absorb the reconnecting clients from the crashed pods, climb to 32,000 connections each, and run out of memory. All four pods restart within 90 seconds. 65,000 riders lose their driver tracking.
The pod’s file descriptor limit was 65,536 (the Linux default). Each SSE connection holds one file descriptor. The pod also needs file descriptors for Redis connections, log files, JVM internals, and the Netty event loop. At 12,000 SSE connections, the pod was at 12,400 total file descriptors. At 17,000 connections (after absorbing a crashed pod’s clients), it crossed the limit.
The Cause
Three resources constrain concurrent connection count: file descriptors, memory, and CPU. For SSE connections streaming driver locations, memory is the binding constraint.
Memory per SSE connection
Each SSE connection in Spring WebFlux holds:
Component Memory
Netty Channel + pipeline ~2.5KB
Reactor Flux subscription ~1.2KB
SSE output buffer (default 8KB) ~8.0KB
DriverLocation object ~0.3KB
Redis subscription reference ~0.1KB
Micrometer metrics tags ~0.2KB
Total: ~12.3KB
The SSE output buffer dominates. Netty allocates an 8KB direct byte buffer per channel for outgoing data. For 100,000 connections:
$$100{,}000 \times 12.3\text{KB} = 1.23\text{GB}$$
Add JVM overhead (metaspace, GC metadata, thread stacks for the Netty event loop):
SSE connections consume over half the pod’s memory budget. The 30% GC headroom is not optional — without it, a minor allocation spike during garbage collection will trigger an OOM kill. At 70% utilization target, each pod needs 3.3GB allocated to safely hold 20,000 connections.
To hold 100,000 connections on a single pod, the pod needs at least 2.3GB of heap. Running at 100% capacity with no headroom guarantees OOM on the next GC pause. Target 70% utilization: 100,000 connections requires 3.3GB allocated memory.
The practical limit: run 5 pods at 20,000 connections each rather than 2 pods at 50,000 each. Smaller pods are cheaper to replace, faster to drain during deployments, and less damaging when one crashes.
File descriptors
Linux processes have a default file descriptor limit of 1,024 (soft) and 65,536 (hard). Each SSE connection consumes one file descriptor. The pod needs additional descriptors for:
The default Linux soft limit of 1,024 file descriptors will kill the pod at just 974 SSE connections. The hard limit of 65,536 provides enough headroom, but the soft limit must be raised explicitly. SSE connections account for 98% of all file descriptors — every other source combined is a rounding error.
The soft limit of 1,024 will kill the pod at 974 SSE connections. The hard limit of 65,536 provides headroom for 20,000 connections. Set the soft limit to match the hard limit:
# /etc/security/limits.conf (on the container base image)
* soft nofile 131072
* hard nofile 131072
Or in the Kubernetes pod security context:
securityContext:
sysctls:
- name: net.core.somaxconn
value: "32768"
With the container runtime setting:
# In the Dockerfile
RUN echo "* soft nofile 131072" >> /etc/security/limits.conf && \
echo "* hard nofile 131072" >> /etc/security/limits.conf
The Baseline
Current state before scaling work:
Metric Value
Pods 4
Connections per pod 12,000 (peak)
Total connections 48,000
Memory per pod 1.5GB allocated, 1.2GB used
File descriptor limit 65,536 (hard)
File descriptors used 12,400
Redis Pub/Sub channels ~8,000 (unique drivers being tracked)
Pod restart recovery 60-90 seconds
Target state:
Metric Value
Pods 5
Connections per pod 20,000 (target, 25,000 limit)
Total connections 100,000
Memory per pod 3.5GB allocated
File descriptor limit 131,072
Redis Pub/Sub channels ~15,000
Pod restart recovery < 10 seconds (graceful drain)
The Fix
Redis Pub/Sub for cross-instance broadcasting
When a driver sends a location update, it hits one pod. The 500 riders tracking that driver are distributed across all 5 pods. Redis Pub/Sub broadcasts the update:
// SCALED: Driver location ingestion with Redis Pub/Sub broadcast
@RestController
public class DriverLocationIngestController {
private final ReactiveRedisTemplate<String, DriverLocation> redisTemplate;
private final MeterRegistry meterRegistry;
@PostMapping("/api/drivers/location")
public Mono<Void> updateLocation(@RequestBody DriverLocation location) {
String channel = "driver:location:" + location.driverId();
return redisTemplate.convertAndSend(channel, location)
.doOnSuccess(receivers ->
meterRegistry.counter("driver.location.published",
"driver", location.driverId()).increment()
)
.then();
}
}
Each pod subscribes to Redis channels for the drivers its local riders are tracking:
// SCALED: Per-pod Redis subscription management
@Service
public class DriverLocationSubscriptionManager {
private final ReactiveRedisTemplate<String, DriverLocation> redisTemplate;
private final ConcurrentHashMap<String, Disposable> subscriptions =
new ConcurrentHashMap<>();
private final ConcurrentHashMap<String, Set<FluxSink<DriverLocation>>> listeners =
new ConcurrentHashMap<>();
public Flux<DriverLocation> subscribe(String driverId) {
return Flux.create(sink -> {
listeners.computeIfAbsent(driverId, k -> ConcurrentHashMap.newKeySet())
.add(sink);
// Subscribe to Redis channel if this is the first listener for this driver
subscriptions.computeIfAbsent(driverId, k -> {
String channel = "driver:location:" + driverId;
return redisTemplate.listenToChannel(channel)
.map(ReactiveSubscription.Message::getMessage)
.subscribe(location -> {
Set<FluxSink<DriverLocation>> sinks = listeners.get(driverId);
if (sinks != null) {
sinks.forEach(s -> s.next(location));
}
});
});
sink.onDispose(() -> {
Set<FluxSink<DriverLocation>> sinks = listeners.get(driverId);
if (sinks != null) {
sinks.remove(sink);
if (sinks.isEmpty()) {
listeners.remove(driverId);
Disposable sub = subscriptions.remove(driverId);
if (sub != null) sub.dispose();
}
}
});
});
}
}
The subscription manager tracks how many local riders are watching each driver. When the first rider on this pod starts tracking driver-123, the pod subscribes to driver:location:driver-123 on Redis. When the last rider stops tracking, the pod unsubscribes. This prevents subscribing to channels nobody on this pod cares about.
The fan-out problem
One driver location update triggers fan-out to all riders tracking that driver. During a concert let-out, 500 riders might track drivers in the same area. A single driver update publishes to Redis once. Redis delivers it to all 5 pods. Each pod pushes it to its ~100 local riders tracking that driver.
The math:
1 driver update → 1 Redis publish
1 Redis publish → 5 pod deliveries (1 per subscriber)
5 pod deliveries → 500 SSE writes (100 per pod)
Rate: 3 updates/second/driver × 200 active drivers in zone
= 600 Redis publishes/second
= 3,000 pod deliveries/second
= 300,000 SSE writes/second across all pods
300,000 SSE writes per second is the peak fan-out during a concert let-out. Each write is ~200 bytes. Total throughput: 60MB/s across 5 pods, 12MB/s per pod. Netty handles this on 4 event loop threads.
Redis Pub/Sub limitation: messages are fire-and-forget. If a pod is slow processing messages, Redis does not buffer them. For location updates, a missed message means the rider’s map shows a position that is 300ms stale. The next update corrects it. For ride acceptance messages, this is unacceptable, which is why ride acceptance uses a separate WebSocket connection with application-level acknowledgments and a database-backed retry queue.
Connection draining during deployments
A rolling deployment terminates pods one at a time. Each pod holds 20,000 SSE connections. If Kubernetes kills the pod immediately, 20,000 riders lose their driver tracking and reconnect to the remaining pods, spiking their connection count by 5,000 each.
Graceful draining sends a custom SSE event telling clients to reconnect before the pod terminates:
// SCALED: Graceful connection draining on shutdown
@Component
public class ConnectionDrainer {
private final List<FluxSink<ServerSentEvent<String>>> activeSinks =
new CopyOnWriteArrayList<>();
public void registerSink(FluxSink<ServerSentEvent<String>> sink) {
activeSinks.add(sink);
sink.onDispose(() -> activeSinks.remove(sink));
}
@PreDestroy
public void drainConnections() {
log.info("Draining {} SSE connections", activeSinks.size());
ServerSentEvent<String> reconnectEvent = ServerSentEvent.<String>builder()
.event("reconnect")
.data("server-shutdown")
.retry(Duration.ofMillis(100))
.build();
// Send reconnect event to all connections
activeSinks.forEach(sink -> {
try {
sink.next(reconnectEvent);
} catch (Exception e) {
// Connection already closed
}
});
// Wait for clients to reconnect to other pods
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
// Close remaining connections
activeSinks.forEach(FluxSink::complete);
log.info("Connection drain complete");
}
}
The client handles the reconnect event:
source.addEventListener("reconnect", () => {
source.close();
// Small random delay to prevent thundering herd
const delay = Math.random() * 2000;
setTimeout(() => connectToDriver(driverId), delay);
});
The Math.random() * 2000 spreads reconnections over 2 seconds. Without jitter, 20,000 clients reconnect simultaneously, overwhelming the remaining pods. The retry(Duration.ofMillis(100)) in the SSE event is a fallback: if the client does not handle the reconnect event, the browser’s built-in EventSource reconnection fires after 100ms.
Kubernetes deployment manifest
# SCALED: Kubernetes Deployment for SSE driver location service
apiVersion: apps/v1
kind: Deployment
metadata:
name: driver-location-sse
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Never reduce capacity during deploy
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: sse-service
image: ridehail/driver-location-sse:latest
resources:
requests:
memory: "2.5Gi"
cpu: "2"
limits:
memory: "3.5Gi"
cpu: "4"
env:
- name: JAVA_OPTS
value: >-
-Xmx2g -Xms2g
-XX:MaxDirectMemorySize=512m
-XX:+UseZGC
-XX:+ZGenerational
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "curl -s -X POST localhost:8080/actuator/drain && sleep 10"
initContainers:
- name: sysctl-init
image: busybox:1.36
command:
- sh
- -c
- |
sysctl -w net.core.somaxconn=32768
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
securityContext:
privileged: true
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: driver-location-sse-pdb
spec:
minAvailable: 4
selector:
matchLabels:
app: driver-location-sse
Key configuration decisions:
-
maxUnavailable: 0ensures no capacity reduction during deploys. The new pod starts, passes readiness, and absorbs connections before the old pod drains. -
terminationGracePeriodSeconds: 30gives 30 seconds for the drain sequence: send reconnect events, wait 5 seconds, close connections, JVM shutdown. -
-XX:+UseZGC -XX:+ZGenerationaluses ZGC for sub-millisecond GC pauses. With 20,000 live objects (one per connection), a stop-the-world GC pause would freeze all 20,000 streams. ZGC’s concurrent collection avoids this. -
-XX:MaxDirectMemorySize=512mcaps Netty’s direct byte buffers. Without this, Netty allocates direct memory without bound and the pod OOMs outside the JVM heap. -
PodDisruptionBudget: minAvailable: 4prevents cluster operations (node drain, scaling) from killing more than one pod at a time. At 5 replicas, one pod down means 20,000 connections redistribute to 4 pods. Two pods down means 40,000 connections redistribute to 3 pods, which is too close to the 25,000-per-pod limit. -
initContainerssysctl raisessomaxconn(TCP connection backlog) and expands the ephemeral port range. The defaultsomaxconnof 128 causes connection drops during reconnection storms.
Locust simulation: 100k SSE connections
# load-tests/sse_100k_locustfile.py
import json
import sseclient
import requests
from locust import User, task, between, events
from locust.runners import MasterRunner
class SSEUser(User):
wait_time = between(60, 120) # Stay connected, reconnect on failure
def on_start(self):
self.driver_id = f"driver-{self.environment.runner.user_count % 500}"
self.connect()
def connect(self):
try:
url = f"{self.host}/api/sse/drivers/{self.driver_id}/location"
response = requests.get(url, stream=True, timeout=120)
self.client_sse = sseclient.SSEClient(response)
for event in self.client_sse.events():
if event.event == 'location':
data = json.loads(event.data)
events.request.fire(
request_type="SSE",
name="/sse/driver/location",
response_time=0,
response_length=len(event.data),
exception=None,
context={}
)
elif event.event == 'reconnect':
break
except Exception as e:
events.request.fire(
request_type="SSE",
name="/sse/driver/location",
response_time=0,
response_length=0,
exception=e,
context={}
)
@task
def maintain_connection(self):
# Reconnect if disconnected
self.connect()
Run with distributed Locust across 10 workers:
# Master
locust -f load-tests/sse_100k_locustfile.py \
--master \
--host=http://sse-service.ridehail.svc.cluster.local:8080
# Workers (10 instances, each manages 10k connections)
locust -f load-tests/sse_100k_locustfile.py \
--worker \
--master-host=locust-master
10 workers, each holding 10,000 SSE connections, simulating 100,000 concurrent riders tracking 500 unique drivers.
The Proof
Resource consumption at 100,000 concurrent SSE connections across 5 pods:
Metric Per Pod Total (5 pods)
SSE connections 20,000 100,000
Memory used 2.1GB 10.5GB
Memory allocated 3.5GB 17.5GB
CPU (avg) 1.8 cores 9 cores
CPU (peak fan-out) 3.2 cores 16 cores
File descriptors 20,450 102,250
Redis Pub/Sub channels 15,000 15,000 (shared)
Event delivery p99 38ms 38ms
Comparison with the polling baseline:
| Metric | Polling (25k RPS) | SSE (100k conn) | Delta |
|---|---|---|---|
| CPU | 12 cores | 9 cores | -25% |
| Memory | 8GB (JVM + conn pool) | 17.5GB | +119% |
| Bandwidth | 2.8TB/month | 180GB/month | -94% |
| Update latency (p99) | 2,000ms | 38ms | -98% |
| Active data efficiency | 15% | 100% | +567% |
SSE uses more memory because it holds connections open. It uses less CPU because there is no per-request overhead. It uses far less bandwidth because updates only send when data changes. The update latency drops from 2 seconds (poll interval) to 38ms (Redis Pub/Sub propagation + SSE write).
The memory increase is the trade. 17.5GB of RAM costs $35/month on cloud infrastructure. The 94% bandwidth reduction saves $280/month. The 98% latency improvement is worth more than both: riders see drivers moving in real-time, not jumping between positions every 2 seconds.
Deployment drain test: terminate one pod during peak load. Before drain logic, 20,000 connections reconnect simultaneously within 100ms, spike the remaining pods by 5,000 connections each in a thundering herd. After drain logic with jitter, connections redistribute over 2 seconds. No pod exceeds 22,000 connections during the transition. No rider sees more than 200ms of tracking interruption.