Skip to main content
surviving the spike

WebSockets vs SSE: Architecture and the Right Tool for Each Feature

9 min read Chapter 35 of 66

WebSockets vs SSE: Architecture and the Right Tool for Each Feature

The Symptom

The team implementing real-time driver tracking opens a pull request that adds WebSocket support for the rider app’s map view. The driver’s location streams to the rider via a WebSocket connection. The code works in development. In staging, behind an AWS Application Load Balancer, 30% of connections fail to establish. The ALB’s WebSocket idle timeout is 60 seconds. Riders whose drivers are stopped at a red light for 61 seconds lose their connection. The reconnection logic, hand-written because WebSocket has no built-in reconnection, has a bug that doubles the connection count on retry.

The same feature, implemented with SSE, would have worked through the ALB without configuration changes. The browser’s EventSource API handles reconnection automatically with exponential backoff. The idle timeout is irrelevant because SSE sends periodic comments as keep-alives.

The team chose WebSocket because they assumed real-time requires WebSocket. The feature is server-to-client only. WebSocket is the wrong tool.

The Cause

WebSocket and SSE solve different problems. The confusion comes from treating them as interchangeable “real-time” protocols. They are not.

WebSocket is a full-duplex protocol. After an HTTP upgrade handshake, the connection switches from HTTP to the WebSocket protocol. Both sides send frames (text or binary) at any time. The protocol has no built-in reconnection, no event typing, no last-event-id replay. The application must implement all of these. WebSocket connections require explicit proxy support because the protocol is not HTTP after the initial handshake.

SSE is an HTTP response that never ends. The server sets Content-Type: text/event-stream and writes events to the response body. The browser’s EventSource API parses the stream, fires event listeners, and handles reconnection automatically. SSE supports event types, event IDs, and retry intervals as first-class protocol features. Because it is standard HTTP, every proxy, CDN, and load balancer handles it without special configuration.

The protocol comparison for the ride-hailing platform:

CharacteristicWebSocketSSE
DirectionFull-duplexServer → Client
Protocol after handshakeWebSocket (not HTTP)HTTP
Binary dataYesNo (text only, base64 for binary)
Auto reconnectionNo (manual)Yes (built-in)
Last-event-id replayNo (manual)Yes (built-in)
Event typesNo (manual framing)Yes (event: field)
Proxy/CDN supportRequires upgrade supportWorks everywhere
Connection overhead~12KB/conn~8KB/conn
Per-message overhead2-14 bytes framing~20 bytes (data: ...\n\n)

WebSocket wins on per-message overhead (2-14 bytes vs ~20 bytes for SSE framing). SSE wins on everything else for server-to-client streaming.

The Baseline

The ride-hailing platform has four real-time features. Map each to the correct protocol:

Driver location to rider (SSE). The rider watches a driver move on a map. Data flows server-to-client only. The rider never sends data on this connection. Reconnection matters because mobile networks are unreliable. The Last-Event-ID header lets the server replay missed events on reconnection.

Ride acceptance by driver (WebSocket). The server sends a ride request to the driver. The driver sends accept or reject back. This is bidirectional. Latency matters because a slow accept means the rider waits. WebSocket’s lower per-message overhead and full-duplex nature fit this use case.

Surge pricing to rider (SSE). The surge multiplier updates every 30-60 seconds during peak. Server-to-client only. SSE.

Driver-rider chat (WebSocket). Both sides send messages. Bidirectional. WebSocket.

The decision framework: if the client never sends data on this connection, use SSE. If the client sends data, use WebSocket.

The Fix

SSE implementation: Driver location streaming

The Spring WebFlux SSE controller returns a Flux<ServerSentEvent>:

// SCALED: SSE endpoint with event IDs and retry configuration
@RestController
@RequestMapping("/api/sse")
public class DriverLocationSseController {

    private final DriverLocationService locationService;

    @GetMapping(value = "/drivers/{driverId}/location",
                produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<DriverLocation>> streamLocation(
            @PathVariable String driverId,
            @RequestHeader(value = "Last-Event-ID", required = false)
                String lastEventId) {

        Flux<DriverLocation> locationStream = locationService
            .locationStream(driverId);

        // Replay missed events if client reconnects with Last-Event-ID
        if (lastEventId != null) {
            long lastTimestamp = Long.parseLong(lastEventId);
            locationStream = locationService
                .locationsSince(driverId, lastTimestamp)
                .concatWith(locationStream);
        }

        // Send heartbeat every 15 seconds to prevent proxy timeouts
        Flux<ServerSentEvent<DriverLocation>> heartbeat = Flux
            .interval(Duration.ofSeconds(15))
            .map(tick -> ServerSentEvent.<DriverLocation>builder()
                .comment("heartbeat")
                .build()
            );

        Flux<ServerSentEvent<DriverLocation>> data = locationStream
            .map(loc -> ServerSentEvent.<DriverLocation>builder()
                .id(String.valueOf(loc.timestamp()))
                .event("location")
                .data(loc)
                .retry(Duration.ofSeconds(5))
                .build()
            );

        return Flux.merge(data, heartbeat);
    }
}

Key details:

  1. Last-Event-ID replay. When the browser reconnects after a network drop, it sends the last received event ID in a header. The server replays events since that timestamp, then resumes the live stream with concatWith. The rider’s map does not show a gap.

  2. Heartbeat comments. SSE comments (lines starting with :) are ignored by the EventSource API but keep the connection alive through proxies that close idle connections. The 15-second interval is shorter than the typical 60-second proxy timeout.

  3. Retry directive. retry(Duration.ofSeconds(5)) tells the browser to wait 5 seconds before reconnecting after a disconnect. This is a suggestion; the browser may apply backoff.

The client-side implementation uses the browser’s EventSource:

// SCALED: EventSource with automatic reconnection
function trackDriver(
  driverId: string,
  onLocation: (loc: DriverLocation) => void,
) {
  const source = new EventSource(`/api/sse/drivers/${driverId}/location`);

  source.addEventListener("location", (event: MessageEvent) => {
    const location: DriverLocation = JSON.parse(event.data);
    onLocation(location);
  });

  source.onerror = () => {
    // EventSource automatically reconnects with Last-Event-ID header
    // No manual reconnection logic needed
    console.warn(`SSE connection interrupted for driver ${driverId}`);
  };

  return () => source.close();
}

The EventSource API handles reconnection. When the connection drops, the browser waits the retry interval, then sends a new request with the Last-Event-ID header set to the last received event’s ID. The application code does not implement reconnection logic. The protocol handles it.

WebSocket implementation: Ride acceptance

The driver app connects via WebSocket for ride acceptance. Spring WebFlux’s WebSocketHandler manages the bidirectional stream:

// SCALED: WebSocket handler with structured message protocol
@Component
public class RideAcceptanceHandler implements WebSocketHandler {

    private final RideMatchingService matchingService;
    private final ObjectMapper objectMapper;
    private final MeterRegistry meterRegistry;

    @Override
    public Mono<Void> handle(WebSocketSession session) {
        String driverId = extractDriverId(session);
        Counter connections = meterRegistry.counter(
            "ws.ride.connections", "driver", driverId);
        connections.increment();

        // Outgoing: ride requests to driver
        Flux<WebSocketMessage> outgoing = matchingService
            .rideRequestsForDriver(driverId)
            .map(request -> {
                RideMessage msg = new RideMessage(
                    "ride_request",
                    request.rideId(),
                    objectMapper.writeValueAsString(request)
                );
                return session.textMessage(
                    objectMapper.writeValueAsString(msg));
            })
            .doOnNext(msg ->
                meterRegistry.counter("ws.ride.sent").increment());

        // Incoming: driver responses (accept/reject)
        Mono<Void> incoming = session.receive()
            .map(WebSocketMessage::getPayloadAsText)
            .map(text -> objectMapper.readValue(text, RideResponse.class))
            .filter(response ->
                "accept".equals(response.action())
                || "reject".equals(response.action()))
            .flatMap(response -> {
                meterRegistry.counter("ws.ride.responses",
                    "action", response.action()).increment();
                return matchingService.processDriverResponse(
                    driverId, response);
            })
            .then();

        // Ping every 30 seconds to detect dead connections
        Flux<WebSocketMessage> pings = Flux.interval(Duration.ofSeconds(30))
            .map(tick -> session.pingMessage(
                factory -> factory.wrap(new byte[0])));

        return Mono.zip(
            session.send(Flux.merge(outgoing, pings)),
            incoming
        ).then()
        .doFinally(signal -> {
            log.info("WebSocket closed for driver {}: {}",
                driverId, signal);
            connections.increment(-1);
        });
    }

    private String extractDriverId(WebSocketSession session) {
        String driverId = session.getHandshakeInfo()
            .getHeaders().getFirst("X-Driver-Id");
        if (driverId == null || driverId.isBlank()) {
            throw new IllegalStateException(
                "Missing X-Driver-Id header in WebSocket handshake");
        }
        return driverId;
    }
}

The handler differs from SSE in three ways:

  1. Bidirectional streams. session.receive() handles incoming messages from the driver. session.send() pushes ride requests to the driver. Both run concurrently via Mono.zip.

  2. Manual ping/pong. WebSocket has no automatic keep-alive at the application level. The handler sends pings every 30 seconds. If the driver’s connection is dead, the pong timeout triggers cleanup. SSE’s heartbeat comments serve the same purpose but are built into the protocol.

  3. No automatic reconnection. If the WebSocket connection drops, the driver app must reconnect manually:

// Client-side WebSocket with manual reconnection
class RideAcceptanceSocket {
  private ws: WebSocket | null = null;
  private reconnectDelay = 1000;
  private maxReconnectDelay = 30000;

  connect(driverId: string) {
    this.ws = new WebSocket(`wss://api.ridehail.com/ws/rides`);

    this.ws.onmessage = (event) => {
      const msg = JSON.parse(event.data);
      if (msg.type === "ride_request") {
        this.handleRideRequest(msg.payload);
      }
    };

    this.ws.onclose = () => {
      setTimeout(() => this.connect(driverId), this.reconnectDelay);
      this.reconnectDelay = Math.min(
        this.reconnectDelay * 2,
        this.maxReconnectDelay,
      );
    };

    this.ws.onopen = () => {
      this.reconnectDelay = 1000; // Reset backoff on successful connect
    };
  }

  sendResponse(rideId: string, action: "accept" | "reject") {
    this.ws?.send(JSON.stringify({ rideId, action }));
  }

  private handleRideRequest(request: RideRequest) {
    // Show ride request UI to driver
  }
}

This reconnection logic is what SSE gives you for free. For WebSocket, it is application code that must be written, tested, and maintained. The exponential backoff prevents thundering herd reconnection storms when an instance restarts.

Connection lifecycle comparison

The SSE connection lifecycle:

SSE connection lifecycle sequence diagram showing automatic reconnection with Last-Event-ID after a network interruption, with the server replaying missed events

The SSE lifecycle shows the key advantage: after a network interruption, the browser reconnects automatically and sends the Last-Event-ID header. The server uses this to replay any events the client missed during the disconnection. No application-level reconnection code is needed — the EventSource API handles it natively.

The WebSocket connection lifecycle:

WebSocket connection lifecycle sequence diagram showing bidirectional communication, manual reconnection with exponential backoff, and no built-in event replay

The WebSocket lifecycle highlights the tradeoff: full-duplex communication enables the driver to send ride acceptance responses, but reconnection is entirely manual. After a network interruption, the application must implement exponential backoff, re-establish the connection, and track which messages were missed. There is no equivalent of Last-Event-ID — the application must maintain its own state.

The SSE reconnection carries the last event ID. The server replays missed events. The WebSocket reconnection starts fresh. The application must track what the driver has seen and re-request missed ride offers.

The Proof

Feature-to-protocol mapping after implementation:

FeatureProtocolReconnectKeep-aliveProxy support
Driver locationSSEAutomaticComment heartbeatNative HTTP
Surge pricingSSEAutomaticComment heartbeatNative HTTP
Ride acceptanceWebSocketManual + backoffPing/pongRequires upgrade
Driver-rider chatWebSocketManual + backoffPing/pongRequires upgrade

The ALB WebSocket failure from the symptom section: resolved by switching driver location from WebSocket to SSE. The ALB’s 60-second idle timeout is irrelevant because SSE heartbeat comments fire every 15 seconds. No ALB configuration change was needed.

Connection establishment success rate:

Protocol    Success Rate    Avg Setup Time
SSE         99.7%           120ms
WebSocket   97.2%           340ms

The 2.5% gap in WebSocket success comes from corporate proxies and mobile networks that block the HTTP upgrade. SSE works through these because it never leaves HTTP. For driver location, that 2.5% failure rate meant 1,250 riders out of 50,000 could not track their driver. SSE eliminated the problem.

Chapter 12-S2 covers what happens when 50,000 SSE connections becomes 100,000: memory limits, file descriptor configuration, Redis Pub/Sub fan-out, and Kubernetes resource management.