CPU Throttling: The Silent Latency Killer

The main chapter showed CFS bandwidth control throttling the article service for 40ms per burst, creating P99 spikes at 180ms. This section goes deeper: how CFS period and quota interact, why the default 100ms period is wrong for latency-sensitive services, how to read throttling statistics from the cgroup filesystem, and the concrete Kubernetes configuration that eliminated throttling for the content platform.

CFS Bandwidth Control Internals

The Linux Completely Fair Scheduler uses two parameters for CPU bandwidth enforcement:

cpu.cfs_period_us:  The time window (microseconds). Default: 100000 (100ms).
cpu.cfs_quota_us:   CPU time allowed per period (microseconds).

Relationship to Kubernetes limits:
  limits.cpu: "2"   → quota = 2 × period = 200000us (200ms per 100ms period)
  limits.cpu: "500m" → quota = 0.5 × period = 50000us (50ms per 100ms period)
  limits.cpu: "4"   → quota = 4 × period = 400000us (400ms per 100ms period)

Key rule: Quota is consumed by ALL threads in the cgroup combined.
  8 threads each using 25ms of CPU in a period = 200ms quota consumed.
  1 thread using 200ms of CPU in a period = 200ms quota consumed.
  Both hit the same limit.

This is the core problem for the JVM. The JVM runs many threads concurrently: application threads, GC threads, JIT compiler threads, Netty I/O threads. During a GC pause, all GC threads run simultaneously on all available cores. The quota is consumed at a rate proportional to the thread count, not wall-clock time.

Symptom: P99 Spikes at Low Average CPU

The article service processes search queries. Average response time is 12ms. Average CPU is 35%. Operations sees no problem.

Then the P99 SLO breach alert fires. P99 has risen from 30ms to 180ms. No deployment. No traffic change. No dependency slowdown.

Cause: GC Bursts Exhausting CFS Quota

Timeline of a single P99 spike:

  t=0ms:     Period starts. Quota: 200ms available.
  t=0-18ms:  Application threads serve 4 requests (8 threads × 18ms = 144ms quota)
  t=18ms:    G1GC triggers Young Collection
  t=18-38ms: GC pause. 25 ParallelGCThreads active (JVM saw 32 host cores)
             CPU consumed: 25 threads × 20ms = 500ms
             But quota remaining was only 56ms (200ms - 144ms)
             Quota exhausted at t=20.2ms (56ms / 25 threads = 2.2ms into GC)

  t=20.2ms:  CFS THROTTLES the cgroup. All threads frozen.
             GC is mid-pause. Application threads frozen. I/O threads frozen.

  t=100ms:   New period starts. 200ms quota refilled.
  t=100ms:   GC resumes with remaining work (~18ms of 20ms pause remaining)
             25 threads × 18ms = 450ms quota needed
             Quota exhausted again at ~108ms

  t=200ms:   Third period. GC finishes.
             Application threads resume.

  Total wall time for a 20ms GC pause: 200ms+
  Request that arrived at t=15ms waited: 185ms

  This is the P99 spike.

Benchmark: Measuring the Throttle

# Step 1: Record throttling stats before load test
cat /sys/fs/cgroup/cpu.stat > /tmp/before.txt

# Step 2: Run load test (wrk2 at constant 5000 RPS for 60 seconds)
wrk2 -t4 -c200 -d60s -R5000 --latency http://localhost:8080/api/search?q=java

# Step 3: Record throttling stats after
cat /sys/fs/cgroup/cpu.stat > /tmp/after.txt

# Step 4: Calculate throttling rate
diff /tmp/before.txt /tmp/after.txt

Content platform article service results:

Before load test:
  nr_periods 1000000
  nr_throttled 28500
  throttled_usec 412500000

After load test (60 seconds later):
  nr_periods 1000600    (+600 periods = 60 seconds)
  nr_throttled 28519    (+19 throttled periods)
  throttled_usec 413250000  (+750ms total throttle time)

Throttle rate during test: 19/600 = 3.2% of periods
Average throttle duration: 750ms / 19 = 39.5ms per event
Estimated P99 impact: 39.5ms added latency on ~3.2% of requests

Continuous Monitoring

// Prometheus metric exporter for CFS throttling
// Reads cgroup v2 cpu.stat every 5 seconds
@Component
public class CfsThrottleMetrics {

    private final Counter throttledPeriods = Counter.build()
        .name("container_cpu_throttled_periods_total")
        .help("Number of CFS periods where CPU was throttled")
        .register();

    private final Counter throttledTime = Counter.build()
        .name("container_cpu_throttled_seconds_total")
        .help("Total time CPU was throttled in seconds")
        .register();

    private long lastThrottledCount = 0;
    private long lastThrottledUsec = 0;

    @Scheduled(fixedRate = 5000)
    public void collectThrottleMetrics() {
        try {
            // cgroup v2 path
            Map<String, Long> stats = parseCpuStat("/sys/fs/cgroup/cpu.stat");

            long currentThrottled = stats.getOrDefault("nr_throttled", 0L);
            long currentThrottledUsec = stats.getOrDefault("throttled_usec", 0L);

            if (lastThrottledCount > 0) {
                throttledPeriods.inc(currentThrottled - lastThrottledCount);
                throttledTime.inc(
                    (currentThrottledUsec - lastThrottledUsec) / 1_000_000.0
                );
            }

            lastThrottledCount = currentThrottled;
            lastThrottledUsec = currentThrottledUsec;
        } catch (IOException e) {
            // cgroup filesystem not available (not in container)
        }
    }

    private Map<String, Long> parseCpuStat(String path) throws IOException {
        Map<String, Long> stats = new HashMap<>();
        for (String line : Files.readAllLines(Path.of(path))) {
            String[] parts = line.split(" ");
            if (parts.length == 2) {
                stats.put(parts[0], Long.parseLong(parts[1]));
            }
        }
        return stats;
    }
}

Alert rule:

# Prometheus alert: throttling exceeding 5% of periods
groups:
  - name: container-cpu
    rules:
      - alert: CpuThrottlingHigh
        expr: |
          rate(container_cpu_throttled_periods_total[5m])
          / rate(container_cpu_cfs_periods_total[5m]) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Container CPU throttling > 5%"
          description: "{{ $labels.pod }} throttled {{ $value | humanizePercentage }} of periods"

Fix 1: Reduce GC Thread Count

The main chapter showed the fix: set ParallelGCThreads and ConcGCThreads to match the container CPU limit, not the host core count. Here is the detailed benchmark:

# Test matrix: GC thread count vs throttling and latency
# Setup: article-service, 2-core container, 4GB memory, 500MB live heap
# Load: wrk2 at 5000 RPS constant, 60 seconds

# Test 1: Default GC threads (JVM sees 32 host cores)
java -Xmx2g -jar article-service.jar
# ParallelGCThreads=25, ConcGCThreads=6
# Results:
#   P50: 12.1ms   P99: 178ms   P99.9: 312ms
#   Throttled periods: 3.2%   Avg throttle: 39ms
#   GC pause avg: 18ms   GC pause max: 35ms

# Test 2: GC threads = CPU limit (2)
java -Xmx2g -XX:ParallelGCThreads=2 -XX:ConcGCThreads=1 \
     -jar article-service.jar
# Results:
#   P50: 12.3ms   P99: 52ms    P99.9: 78ms
#   Throttled periods: 0.1%   Avg throttle: 8ms
#   GC pause avg: 42ms   GC pause max: 65ms

# Test 3: GC threads = 2× CPU limit (4)
java -Xmx2g -XX:ParallelGCThreads=4 -XX:ConcGCThreads=2 \
     -jar article-service.jar
# Results:
#   P50: 12.2ms   P99: 38ms    P99.9: 62ms
#   Throttled periods: 0.4%   Avg throttle: 12ms
#   GC pause avg: 28ms   GC pause max: 48ms

Results summary:

  GC Threads    GC Pause    Throttle%    P99     P99.9
  ─────────────────────────────────────────────────────
  25 (default)  18ms        3.2%         178ms   312ms
  4 (2× limit)  28ms        0.4%         38ms    62ms
  2 (= limit)   42ms        0.1%         52ms    78ms

Best choice: 4 GC threads (2× CPU limit)
  - GC pauses stay short (28ms vs 42ms)
  - Throttling nearly eliminated (0.4%)
  - P99 is lowest at 38ms
  - P99.9 is lowest at 62ms

The sweet spot is 2x the CPU limit, not 1x. At 1x (2 threads), GC pauses are too long (42ms) and the P99 is dominated by pause time rather than throttling. At 2x (4 threads), GC pauses are shorter (28ms) and the small amount of throttling (0.4%) adds less than the reduced GC pause saves.

Fix 2: Increase the CPU Limit (Allow Bursting)

If GC thread tuning is insufficient, increase the CPU limit to accommodate bursts:

# Before: tight limit triggers throttling
resources:
  requests:
    cpu: "1"
  limits:
    cpu: "2"

# After: higher limit allows GC bursts
resources:
  requests:
    cpu: "1"
  limits:
    cpu: "4"

Impact on throttling:

  Limit=2 cores (200ms quota):
    GC burst (4 threads × 28ms) = 112ms quota
    App threads (8 threads × 10ms) = 80ms quota
    Total: 192ms — barely fits, any variance causes throttle

  Limit=4 cores (400ms quota):
    GC burst (4 threads × 28ms) = 112ms quota
    App threads (8 threads × 10ms) = 80ms quota
    Total: 192ms — 208ms headroom, no throttle

Trade-off: Higher limits mean fewer pods per node if other pods also have high limits. But since requests stay at 1 core, the scheduler still packs efficiently. The limit only matters during bursts, which happen a few percent of the time.

Fix 3: Remove CPU Limits

The most aggressive fix. Set no CPU limit and rely on requests for scheduling:

resources:
  requests:
    cpu: "1"
    memory: "4Gi"
  limits:
    # No cpu limit
    memory: "4Gi"

# Verification: no CFS quota enforcement
cat /sys/fs/cgroup/cpu.max
# max 100000
# "max" means no quota (unlimited)

# Throttle stats will show zero growth:
cat /sys/fs/cgroup/cpu.stat
# nr_throttled 0
# throttled_usec 0

Content platform results after removing CPU limits:

  Metric          With 2-core limit    No CPU limit    Change
  ──────────────────────────────────────────────────────────────
  P50             12.1ms               11.8ms          -2.5%
  P99             178ms                27ms            -84.8%
  P99.9           312ms                45ms            -85.6%
  Throttle rate   3.2%                 0%              eliminated
  Avg CPU usage   0.7 cores            0.7 cores       unchanged
  Peak CPU usage  2.0 cores (capped)   3.8 cores       burst allowed
  GC pause avg    18ms                 15ms            -16.7%
  GC pause max    35ms                 22ms            -37.1%

P99 dropped from 178ms to 27ms. Peak CPU bursts to 3.8 cores during GC, but only for 15ms. Average CPU is unchanged. The node is not overloaded because requests-based scheduling prevents overcommit at the average level.

Fix 4: Tune the CFS Period

For workloads where CPU limits are required (multi-tenant clusters, cost allocation), reducing the CFS period reduces the maximum throttle duration:

Default period: 100ms, quota 200ms (2 cores)
  Worst case throttle: up to 100ms (entire remaining period)

Reduced period: 10ms, quota 20ms (still 2 cores)
  Worst case throttle: up to 10ms (shorter period = shorter max throttle)
  But: GC burst (112ms quota) is spread across 12 periods (112ms/10ms = ~12)
       Each period contributes ~17ms of quota, fitting within 20ms budget.
       Throttling may not occur at all for moderate bursts.

# Set CFS period to 10ms (requires host-level access or Kubernetes kubelet config)
# In Kubernetes, set via kubelet --cpu-cfs-quota-period=10ms
# Or per-container via cgroup v2:
echo 20000 10000 > /sys/fs/cgroup/cpu.max
# Format: quota_us period_us
# 20000us quota per 10000us period = 2 cores

Period tuning benchmark (2-core limit, article service):

  Period    Max Throttle    P99      P99.9
  ─────────────────────────────────────────
  100ms     39ms            178ms    312ms
  50ms      22ms            95ms     155ms
  20ms      11ms            52ms     88ms
  10ms      6ms             38ms     65ms
  5ms       3ms             31ms     52ms

Trade-off: Shorter periods increase scheduler overhead. At 5ms periods, the kernel performs 200 scheduling decisions per second per cgroup instead of 10. On hosts with hundreds of containers, this adds measurable kernel CPU overhead (0.5-1% of a core per container). The 10-20ms range provides the best balance for latency-sensitive services.

Requests vs Limits: The Complete Decision Framework

Decision matrix for CPU resource configuration:

  Workload Type              Requests    Limits    Rationale
  ──────────────────────────────────────────────────────────────────────────
  Latency-sensitive Java     avg usage   none      GC/JIT need burst headroom
  Batch processing           avg usage   2× avg    prevent runaway, no latency SLO
  Sidecar (envoy, fluentd)   measured    measured  predictable, no bursts
  CronJob                    peak        peak      runs briefly, needs all resources
  Multi-tenant (billing)     avg usage   allocated  CPU limit tied to cost allocation

  Content platform services:
    article-service:   requests=1,  limits=none    (latency-critical)
    search-indexer:    requests=2,  limits=4       (batch, bounded burst)
    nginx-proxy:       requests=200m, limits=500m  (predictable, low burst)
    recommendation:    requests=1,  limits=none    (latency-critical)
    analytics-writer:  requests=500m, limits=1     (background, bounded)

Proof: Before and After

The content platform article service running in production for 7 days before and after removing CPU limits and tuning GC threads:

Before (2-core limit, default GC threads):
  P50:  12ms (stable)
  P99:  45-210ms (fluctuates with GC frequency)
  P99.9: 180-450ms
  Error rate: 0.02% (OOM-adjacent restarts)
  CFS throttled periods: 2.8-4.1% (varies with traffic)
  Pod restarts/day: 0.3 (occasional OOM kill)

After (no CPU limit, ParallelGCThreads=4, ConcGCThreads=2):
  P50:  11.8ms (stable)
  P99:  22-30ms (stable, only GC pause)
  P99.9: 35-55ms
  Error rate: 0.001%
  CFS throttled periods: 0%
  Pod restarts/day: 0

  P99 improvement: 6-7× reduction
  P99.9 improvement: 5-8× reduction
  Stability: throttle-related variance eliminated entirely

The configuration that achieved this:

# Final JVM flags for article-service container
java \
  -XX:+UseContainerSupport \
  -XX:ActiveProcessorCount=2 \
  -XX:ParallelGCThreads=4 \
  -XX:ConcGCThreads=2 \
  -XX:CICompilerCount=2 \
  -XX:MaxRAMPercentage=50.0 \
  -XX:MaxMetaspaceSize=256m \
  -XX:ReservedCodeCacheSize=128m \
  -XX:MaxDirectMemorySize=256m \
  -Xss512k \
  -Xlog:gc*:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=10m \
  -jar article-service.jar

# Kubernetes resource spec
resources:
  requests:
    cpu: "1"
    memory: "4Gi"
  limits:
    memory: "4Gi"
    # No CPU limit