HTTP/2 Multiplexing and the End of Connection Limits

The main chapter showed HTTP/2 eliminating head-of-line blocking at the application layer. This section examines the mechanics: how frames interleave, how HPACK compresses repetitive headers, how flow control prevents fast senders from overwhelming slow receivers, and why server push was a good idea that failed in practice.

Frame Interleaving on the Wire

In HTTP/1.1, a response must complete before the next request can use the connection. In HTTP/2, the server sends DATA frames from multiple streams in any order. The client reassembles each stream independently:

// Wire capture: HTTP/2 serving article list + recommendation API concurrently
//
// Frame 1: HEADERS  stream=1  :status=200, content-type=application/json
// Frame 2: HEADERS  stream=3  :status=200, content-type=application/json
// Frame 3: DATA     stream=1  [first 16KB of article list response]
// Frame 4: DATA     stream=3  [first 16KB of recommendations response]
// Frame 5: DATA     stream=1  [next 16KB of article list] END_STREAM
// Frame 6: DATA     stream=3  [remaining recommendations] END_STREAM
//
// Both responses complete in ~same wall time as one
// Client reassembles stream 1 frames into article list
// Client reassembles stream 3 frames into recommendations

// HTTP/1.1 equivalent requires 2 connections or serial responses:
// Conn 1: [---article list response (full)---][---next request---]
// Conn 2: [---recommendations response (full)---]

The frame maximum size defaults to 16384 bytes (16KB). For the content platform’s 37KB minified article list response, this means 3 DATA frames per response. With 50 concurrent article list requests, 150 DATA frames interleave with frames from other streams:

// Server-side frame scheduling (simplified Netty H2 behavior):
// 1. Check all streams with pending data
// 2. Apply priority/weight to determine send order
// 3. Send up to SETTINGS_MAX_FRAME_SIZE bytes per stream per round
// 4. Move to next stream
// 5. Repeat until all streams drained or connection window exhausted

// Netty's default scheduling: weighted fair queuing
// Stream with weight 256 gets 2x bandwidth of stream with weight 128
// Content platform: article-list API at weight 220, analytics at weight 32

HPACK Header Compression

HTTP/1.1 headers are plain text, repeated on every request. A typical content platform API request sends:

GET /api/articles?page_size=50 HTTP/1.1
Host: api.contentplatform.com
Accept: application/json
Accept-Encoding: gzip, deflate, br
Authorization: Bearer eyJhbGciOiJSUzI1NiIs... (800+ bytes)
Cookie: session=abc123; preferences=dark-mode; consent=granted
User-Agent: Mozilla/5.0 (Linux; Android 14) AppleWebKit/537.36...
X-Request-ID: 7f8a9b2c-3d4e-5f6a-7b8c-9d0e1f2a3b4c
X-Correlation-ID: trace-2024-abc-def

Total: approximately 1,200 bytes per request. Across 15 resources on a page load, that is 18KB of headers alone. Most of it identical between requests.

HPACK compresses headers using two mechanisms:

// HPACK Static Table (61 pre-defined headers)
// Index 1:  :authority
// Index 2:  :method GET
// Index 3:  :method POST
// Index 8:  :status 200
// Index 16: accept-encoding: gzip, deflate
// Index 31: content-type
// ... 61 entries total

// HPACK Dynamic Table (connection-scoped, shared between requests)
// Entries added as headers are transmitted
// Default max size: 4096 bytes (SETTINGS_HEADER_TABLE_SIZE)

// First request: full header values transmitted, added to dynamic table
// Second request: reference by index (1 byte instead of 800+ bytes for Auth)

// Compression ratio for content platform headers:
// Request 1:  1,200 bytes -> 1,180 bytes (static table refs only)
// Request 2:  1,200 bytes ->   68 bytes  (dynamic table refs for all)
// Request 3+: 1,200 bytes ->   42 bytes  (stable dynamic table)
//
// 96.5% header compression after warmup

The dynamic table has critical performance implications. If the table is too small, entries get evicted, forcing retransmission. For the content platform’s 800-byte Authorization header:

// SETTINGS_HEADER_TABLE_SIZE tuning:
//
// Default (4096 bytes): Fits ~5 large headers before eviction
// Problem: With 15 unique header combinations, eviction thrashing occurs
//
// Recommended (8192 bytes): Fits all common header combinations
// Result: Stable compression after first page load

// Spring Boot configuration:
// server.http2.header-table-size=8192

@Configuration
public class HpackConfig {

    @Bean
    public WebServerFactoryCustomizer<NettyReactiveWebServerFactory> hpackCustomizer() {
        return factory -> factory.addServerCustomizers(httpServer ->
            httpServer.http2Settings(settings ->
                settings.headerTableSize(8192)
                    .maxHeaderListSize(16384)
            )
        );
    }
}

Flow Control

HTTP/2 implements flow control at two levels: connection-level and stream-level. Each level maintains a window size representing how many bytes the sender is allowed to transmit before receiving a WINDOW_UPDATE frame:

// Flow control windows:
//
// Connection window: Total bytes allowed across ALL streams
// Stream window:     Bytes allowed for THIS stream
//
// Effective allowed = min(connection_window, stream_window)
//
// Default: 65,535 bytes (64KB) for both
//
// Problem for content platform:
//   50 concurrent streams * 37KB average response = 1,850KB needed
//   Connection window of 64KB means constant WINDOW_UPDATE ping-pong
//   Each WINDOW_UPDATE: 13 bytes + frame header = additional round trips

// Window too small: throughput limited by WINDOW_UPDATE latency
// Window too large: risk of overwhelming client (mobile with limited buffer)

// Optimal connection window for content platform:
// Target: sustain 50 concurrent responses without stalling
// Calculation: 50 streams * 16KB frame * 2 frames in flight = 1,600KB
// Setting: 2MB connection window (2,097,152 bytes)
// Stream window: 256KB per stream (sufficient for largest single response)

Configuration:

@Bean
public WebServerFactoryCustomizer<NettyReactiveWebServerFactory> flowControlCustomizer() {
    return factory -> factory.addServerCustomizers(httpServer ->
        httpServer.http2Settings(settings ->
            settings.initialWindowSize(262144)  // 256KB per stream
        )
    );
}

// Client-side (for internal services calling other services):
// HttpClient with custom HTTP/2 settings
HttpClient httpClient = HttpClient.newBuilder()
    .version(HttpClient.Version.HTTP_2)
    .build();

// Note: Java's HttpClient does not expose H2 window tuning directly
// For internal services needing fine control, use Netty directly or gRPC

Server Push (Deprecated)

HTTP/2 originally included server push: the server sends resources before the client requests them. The content platform could push CSS and critical JS alongside the HTML response:

// Server push concept (DEPRECATED):
// Client requests: GET /articles
// Server pushes: /css/main.css, /js/critical.js (without waiting for client)
//
// Why it failed:
// 1. Cache invalidation: Server pushes resources the client already has cached
//    Wasted bandwidth on repeated visits (80%+ of content platform traffic)
//
// 2. Priority inversion: Pushed resources compete with requested resources
//    Large pushed CSS blocks small critical API response
//
// 3. Client cancellation: Client must RST_STREAM to cancel unwanted push
//    By then, bytes already on the wire
//
// 4. No good heuristic: Server cannot know client cache state
//    Cache digest proposals never standardized
//
// Chrome removed server push support in 2022 (Chrome 106)
// Replacement: 103 Early Hints (informational response with Link headers)

The replacement pattern for the content platform:

// 103 Early Hints: tell the browser to preload resources while server computes
@GetMapping("/articles")
public ResponseEntity<ArticleListResponse> getArticles() {
    // Send 103 Early Hints immediately (before processing)
    // Browser starts fetching CSS/JS in parallel with server processing

    // This happens at the Nginx/CDN level:
    // HTTP/1.1 103 Early Hints
    // Link: </css/main.css>; rel=preload; as=style
    // Link: </js/critical.js>; rel=preload; as=script
    //
    // ... server processes for 45ms ...
    //
    // HTTP/1.1 200 OK
    // Content-Type: application/json
    // [article list response body]

    ArticleListResponse response = articleService.getArticleList(50);
    return ResponseEntity.ok(response);
}

Stream Prioritization

HTTP/2 originally used a dependency tree model for stream priorities. This proved too complex for implementations and was replaced by RFC 9218 Extensible Priorities:

// RFC 9218 Priority signals:
// Priority: u=N     (urgency, 0-7, lower = more important)
// Priority: u=N, i  (incremental: can be processed as received)
//
// Content platform priority mapping:
// u=0: HTML document (critical rendering path)
// u=1: CSS, critical JS (render-blocking)
// u=2: Article list API, visible images (above the fold)
// u=3: Font files (needed for text rendering)
// u=4: Non-critical JS, below-fold images
// u=5: Recommendation API, analytics
// u=6: Prefetch for next page
// u=7: Background sync, beacon

// Spring Boot response with priority hint:
@GetMapping("/api/articles")
public ResponseEntity<ArticleListResponse> getArticles() {
    ArticleListResponse articles = articleService.getArticleList(50);
    return ResponseEntity.ok()
        .header("Priority", "u=2")
        .body(articles);
}

@GetMapping("/api/recommendations")
public ResponseEntity<RecommendationResponse> getRecommendations() {
    RecommendationResponse recs = recommendationService.getForUser(currentUser());
    return ResponseEntity.ok()
        .header("Priority", "u=5, i")  // lower urgency, incremental
        .body(recs);
}

Benchmark: HTTP/1.1 vs HTTP/2 Under Load

Locust test simulating the content platform’s page load pattern:

# locust_protocol_comparison.py
from locust import HttpUser, task, between, events
import time

class ContentPlatformPageLoad(HttpUser):
    """Simulates a full page load: 15 resources per page view."""
    wait_time = between(1, 3)

    @task
    def full_page_load(self):
        """Load all resources a content page needs."""
        start = time.time()

        # Critical path (blocks rendering)
        self.client.get("/api/articles?page_size=50", name="article-list")
        self.client.get("/css/main.css", name="css-main")
        self.client.get("/js/critical.js", name="js-critical")

        # Above the fold
        self.client.get("/api/recommendations", name="recommendations")
        for i in range(3):
            self.client.get(f"/images/hero-{i}.webp", name="hero-image")

        # Fonts
        for i in range(4):
            self.client.get(f"/fonts/inter-{i}.woff2", name="font")

        # Below the fold
        self.client.get("/js/analytics.js", name="js-analytics")
        self.client.get("/api/user/state", name="user-state")
        self.client.get("/js/lazy.js", name="js-lazy")

        elapsed = time.time() - start
        events.request.fire(
            request_type="PAGE",
            name="full-page-load",
            response_time=elapsed * 1000,
            response_length=0,
            exception=None,
            context={}
        )

Results at 500 concurrent users, 80ms simulated RTT:

HTTP/1.1 (6 connections per user, TLS 1.2):
  Page load P50:      1,240ms
  Page load P99:      3,890ms
  Requests/sec:       4,200
  Active connections: 3,000
  Server CPU:         72%
  Server memory:      3.1GB (connection buffers)

HTTP/2 (1 connection per user, TLS 1.3):
  Page load P50:        380ms
  Page load P99:        820ms
  Requests/sec:       12,600
  Active connections:    500
  Server CPU:           45%
  Server memory:      1.2GB

Improvement:
  P50 page load: 3.3x faster
  P99 page load: 4.7x faster
  Throughput:    3.0x higher
  Connections:   6.0x fewer
  Memory:        2.6x less

The P99 improvement (4.7x) exceeds P50 improvement (3.3x) because HTTP/1.1’s connection contention causes queuing: under load, the 7th request waits for a free connection. This waiting time is highly variable, inflating tail latency.

Max Concurrent Streams Tuning

The server controls how many streams a client can open simultaneously via SETTINGS_MAX_CONCURRENT_STREAMS:

// Too low (default 100): client forced to queue requests
// Effect: reintroduces HTTP/1.1-style waiting for "free" streams
//
// Too high (unlimited): risk of resource exhaustion under attack
// Effect: single client opens 10,000 streams, exhausts server memory
//
// Content platform calculation:
// Typical page: 15 resources
// Prefetch next page: 10 resources
// Background sync: 5 requests
// Safety margin: 2x
// Target: (15 + 10 + 5) * 2 = 60
//
// But: HTTP/2 connections are often shared across browser tabs
// User with 4 tabs: 60 * 4 = 240
// Setting: 250 max concurrent streams

// Protection against abuse:
// Rate limit: max 1000 new streams per second per connection
// If exceeded: GOAWAY frame, force new connection

@Bean
public WebServerFactoryCustomizer<NettyReactiveWebServerFactory> streamLimitCustomizer() {
    return factory -> factory.addServerCustomizers(httpServer ->
        httpServer.http2Settings(settings ->
            settings.maxConcurrentStreams(250)
        )
    );
}

When HTTP/2 Hurts

HTTP/2 is not universally faster. Scenarios where it underperforms HTTP/1.1:

// Scenario 1: Single large download (video, large file)
// HTTP/1.1: Simple byte stream, no framing overhead
// HTTP/2:   Every 16KB wrapped in 9-byte frame header
// Overhead: 9 / 16384 = 0.055% (negligible, but no multiplexing benefit either)
// Verdict: No benefit, no harm

// Scenario 2: TLS certificate > 16KB (multiple large certs in chain)
// HTTP/1.1: Send cert, done. Subsequent requests on same connection.
// HTTP/2:   Must complete TLS before SETTINGS exchange
//           Large cert chains delay SETTINGS, delay all streams
// Mitigation: Use ECDSA certs (smaller than RSA), OCSP stapling

// Scenario 3: Lossy network (2%+ packet loss)
// HTTP/1.1: 6 connections, loss on one does not affect others
// HTTP/2:   1 connection, any TCP loss blocks ALL streams
// Impact at 2% loss, 80ms RTT:
//   HTTP/1.1 page load P99: 2,100ms
//   HTTP/2   page load P99: 2,800ms (TCP HOL blocking)
// Solution: HTTP/3 (QUIC eliminates TCP-level HOL blocking)

The content platform’s traffic profile (many small API calls, 15+ resources per page, global users) is the ideal HTTP/2 workload. The 3.3x P50 improvement confirmed in production: median page load dropped from 1.2s to 360ms after enabling HTTP/2 at the CDN edge.