Skip to main content
fast by design

The JIT Traps That Invalidate Naive Benchmarks

11 min read Chapter 8 of 90

The JIT Traps That Invalidate Naive Benchmarks

Each JIT optimization has a specific mechanism and a specific countermeasure. This section examines four optimizations that routinely produce incorrect benchmark results, with JMH benchmarks that demonstrate both the broken and correct versions.

Dead Code Elimination: The Compiler Removes Your Work

Dead code elimination (DCE) removes code that computes a value never used by the program. The compiler’s reasoning is sound: if no observable behavior depends on a computation, the computation is unnecessary.

A benchmark is a program that exists solely to observe computation. The compiler does not know this.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class DeadCodeTrap {

    private double x = Math.PI;

    // SLOW: Appears fast because C2 eliminates the computation
    @Benchmark
    public void baseline_broken() {
        // The result of Math.log is never used.
        // C2 may eliminate the entire call.
        Math.log(x);
    }

    // FAST: Correct measurement by returning the result
    @Benchmark
    public double baseline_correct_return() {
        return Math.log(x);
    }

    // FAST: Correct measurement using Blackhole
    @Benchmark
    public void baseline_correct_blackhole(Blackhole bh) {
        bh.consume(Math.log(x));
    }
}

Run this benchmark and observe the results:

Benchmark                              Mode  Cnt   Score    Error  Units
DeadCodeTrap.baseline_broken           avgt   10   0.347 ±  0.012  ns/op
DeadCodeTrap.baseline_correct_return   avgt   10  23.891 ±  0.542  ns/op
DeadCodeTrap.baseline_correct_blackhole avgt  10  24.102 ±  0.487  ns/op

The broken version reports 0.3ns. Math.log() cannot execute in 0.3ns. The method performs a floating-point logarithm, which takes at least 20ns on modern hardware. The 0.3ns is the cost of the method call overhead with an empty body after DCE.

The correct versions report approximately 24ns, which is the actual cost of Math.log() on the test hardware.

Verifying DCE with JVM Flags

You can confirm the compiler is eliminating your code by examining the JIT compilation log:

java -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintCompilation \
     -XX:+PrintInlining \
     -jar target/benchmarks.jar DeadCodeTrap.baseline_broken \
     -f 1 -wi 3 -i 1

In the output, look for lines containing dead code or eliminated:

@ 5   java.lang.Math::log (5 bytes)   intrinsic
       (intrinsic eliminated: result unused)

This confirms the compiler detected the unused result and eliminated the intrinsic.

DCE in the Content Platform

Dead code elimination is not just a benchmark trap. It affects performance measurements in application code:

// SLOW: Measuring serialization but discarding the result
public void warmupSerializer() {
    for (int i = 0; i < 10_000; i++) {
        mapper.writeValueAsString(article);
        // Result unused. After JIT, this loop may be empty.
        // The "warmup" did not warm up serialization.
    }
}
// FAST: Force the JVM to keep the result
public void warmupSerializer() {
    byte[] result = null;
    for (int i = 0; i < 10_000; i++) {
        result = mapper.writeValueAsBytes(article);
    }
    // Use the result so the compiler cannot eliminate the loop
    if (result.length == 0) {
        throw new IllegalStateException("unexpected");
    }
}

Constant Folding: The Compiler Precomputes Your Input

Constant folding evaluates expressions with known-constant operands at compile time. If the compiler can prove that both the inputs and the function are deterministic, it replaces the function call with the precomputed result.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class ConstantFoldingTrap {

    // This field is effectively constant after setup
    // But the JIT cannot prove it across @State boundaries
    private int articleCount;
    private double weight;

    @Setup
    public void setup() {
        articleCount = 1000;
        weight = 0.75;
    }

    // SLOW: Literal constants that the compiler folds
    @Benchmark
    public double score_folded() {
        // C2 sees: all inputs are compile-time constants
        // It may compute the result at compile time
        return Math.log(1000) * 0.75 + Math.sqrt(1000);
    }

    // FAST: State fields that the compiler cannot fold
    @Benchmark
    public double score_correct() {
        // C2 cannot prove that articleCount and weight
        // will not change between invocations
        return Math.log(articleCount) * weight + Math.sqrt(articleCount);
    }
}

Expected results:

Benchmark                          Mode  Cnt   Score    Error  Units
ConstantFoldingTrap.score_folded   avgt   10   1.823 ±  0.041  ns/op
ConstantFoldingTrap.score_correct  avgt   10  48.217 ±  1.134  ns/op

The folded version reports 1.8ns because it loads a precomputed constant. The correct version reports 48ns because it actually computes two transcendental functions. The difference is 26x.

Partial Constant Folding

The compiler can partially fold expressions. If one operand is constant and another is variable, it may precompute subexpressions:

@Benchmark
public double partial_fold() {
    // Math.log(1000) is constant-folded to 6.907...
    // Only the multiplication and addition are computed at runtime
    return 6.907755278982137 * weight + Math.sqrt(articleCount);
}

@State fields prevent constant folding because they are heap-allocated objects whose fields can change between invocations. The JIT’s alias analysis cannot prove that no other code modifies the field through a reference. This opacity is by design.

Loop Unrolling: The Compiler Eliminates Loop Overhead

Loop unrolling replicates the loop body multiple times to reduce branch prediction misses and enable vectorization. The C2 compiler unrolls loops with predictable iteration counts and transforms the loop body to process multiple elements per iteration.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class LoopUnrollingTrap {

    private double[] embeddings;

    @Setup
    public void setup() {
        embeddings = new double[512];
        var random = new java.util.Random(42);
        for (int i = 0; i < embeddings.length; i++) {
            embeddings[i] = random.nextDouble();
        }
    }

    // The JIT will unroll this loop and may vectorize it.
    // In a naive benchmark, the loop iterations interact in ways
    // that don't represent real usage.
    @Benchmark
    public double sumEmbeddings() {
        double sum = 0;
        for (int i = 0; i < embeddings.length; i++) {
            sum += embeddings[i];
        }
        return sum;
    }

    // Compare against a manual unroll to see if JIT matches it
    @Benchmark
    public double sumEmbeddingsManualUnroll() {
        double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
        for (int i = 0; i < embeddings.length; i += 4) {
            sum0 += embeddings[i];
            sum1 += embeddings[i + 1];
            sum2 += embeddings[i + 2];
            sum3 += embeddings[i + 3];
        }
        return sum0 + sum1 + sum2 + sum3;
    }
}

Loop unrolling itself is not a problem for JMH benchmarks because JMH does not use a visible loop. The trap occurs in naive benchmarks where the user’s measurement loop gets unrolled and merged with the measured computation, making it impossible to attribute time to individual iterations.

The content platform implication: when benchmarking the recommendation scorer’s dot product computation, the JIT will vectorize the loop using SIMD instructions (AVX2 on modern x86). This is desirable. You want the benchmark to measure the vectorized version because that is what runs in production. JMH allows this optimization while preventing the measurement infrastructure from being optimized away.

# Verify vectorization with JVM flags
java -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintAssembly \
     -XX:CompileCommand=print,*LoopUnrollingTrap.sumEmbeddings \
     -jar target/benchmarks.jar LoopUnrollingTrap.sumEmbeddings \
     -f 1 -wi 3 -i 1 2>&1 | grep -i 'vadd\|vmul'

If you see vaddpd (vector add packed double) instructions, the JIT has vectorized the loop. This is the optimization you want to measure. Manual loop unrolling is unnecessary when the JIT does it for you.

Escape Analysis: The Compiler Eliminates Your Objects

Escape analysis determines whether an object’s reference escapes the method where it was created. An object escapes if:

  1. It is returned from the method
  2. It is stored in a field of a heap object
  3. It is passed to a method that the JIT cannot inline
  4. Its reference is stored in an array that escapes

If an object does not escape, the JIT applies scalar replacement: the object’s fields become local variables, and the allocation is eliminated entirely.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class EscapeAnalysisTrap {

    private long articleId = 42L;
    private int viewCount = 1000;
    private double recencyWeight = 0.8;

    // SLOW: Object does not escape. Allocation may be eliminated.
    @Benchmark
    public double score_eliminated() {
        // ArticleScore is created and consumed within this method.
        // If ArticleScore is small and the constructor is inlined,
        // C2 replaces the object with its fields as local variables.
        ArticleScore score = new ArticleScore(articleId, viewCount, recencyWeight);
        return score.compute();
    }

    // FAST: Force allocation by returning the object
    @Benchmark
    public ArticleScore score_allocated() {
        return new ArticleScore(articleId, viewCount, recencyWeight);
    }

    // Alternative: Force allocation using Blackhole
    @Benchmark
    public void score_blackhole(Blackhole bh) {
        bh.consume(new ArticleScore(articleId, viewCount, recencyWeight));
    }

    record ArticleScore(long articleId, int viewCount, double recencyWeight) {
        double compute() {
            return Math.log1p(viewCount) * recencyWeight;
        }
    }
}

The distinction between score_eliminated and score_allocated matters when you want to benchmark the allocation cost versus the computation cost. If you want to measure computation only (how fast is compute()?), scalar replacement is fine because production code may also benefit from it. If you want to measure allocation cost (how much does creating the object cost?), you need to prevent scalar replacement.

When Escape Analysis Helps in Production

Escape analysis is not just a benchmark trap. It is a genuine optimization that the JIT applies to production code. In the content platform, temporary objects created within a request handler often do not escape:

// This allocation may be eliminated in production
public double scoreArticle(long articleId, int views, double weight) {
    // If the JIT inlines this entire method into the caller,
    // and the caller doesn't store the ArticleScore in a field,
    // scalar replacement eliminates the allocation.
    ArticleScore score = new ArticleScore(articleId, views, weight);
    return score.compute();
}

When profiling allocation with async-profiler, remember that scalar-replaced objects do not appear in the allocation profile because they were never allocated on the heap. If your allocation profile shows lower allocation rates than expected, escape analysis may be eliminating allocations. This is a feature, not a bug. The JIT is doing its job.

Combining Traps: The Realistic Benchmark

Real benchmarks face multiple traps simultaneously. Here is a benchmark for the content platform’s article scoring function that avoids all four traps:

@BenchmarkMode({Mode.AverageTime, Mode.SampleTime})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class ArticleScoringBenchmark {

    // @State fields prevent constant folding
    private Article article;
    private double[] userEmbedding;
    private double[] articleEmbedding;
    private RecommendationScorer scorer;

    @Setup(Level.Trial)
    public void setup() {
        var random = new java.util.Random(42);

        article = new Article(
            random.nextLong(), "Test Article",
            "A".repeat(5000), "test-article",
            List.of("java"), List.of("perf"),
            1L, Instant.now(), Instant.now()
        );

        userEmbedding = random.doubles(512).toArray();
        articleEmbedding = random.doubles(512).toArray();
        scorer = new RecommendationScorer();
    }

    // Returning the result prevents DCE
    @Benchmark
    public double scoreRecommendation() {
        return scorer.score(article, userEmbedding, articleEmbedding);
    }

    // Use Blackhole for multiple results
    @Benchmark
    public void scoreMultipleCandidates(Blackhole bh) {
        // Score is computed fresh each call (not hoisted out of the loop)
        for (int i = 0; i < 10; i++) {
            bh.consume(scorer.score(article, userEmbedding, articleEmbedding));
        }
    }
}

This benchmark:

  • Uses @State fields to prevent constant folding on article data and embeddings
  • Returns the double result to prevent dead code elimination
  • Uses Blackhole.consume() for the multi-result variant
  • Uses @Fork(2) to prevent JIT profile pollution across runs
  • Measures both average time and sample time for distribution analysis

The @Setup(Level.Trial) means setup runs once per fork, not per iteration. Use Level.Trial for expensive setup (database connections, large data structures). Use Level.Invocation only when you need fresh state per benchmark call, and be aware that Level.Invocation adds setup cost to every measurement.

Verifying Your Benchmark Is Not Trapped

Before trusting any JMH result, apply this checklist:

  1. Does the result make physical sense? If a method that performs I/O reports 2ns, DCE eliminated it.
  2. Does changing the input change the result? If doubling the input size does not change the time, constant folding is in play.
  3. Does the error bar overlap zero? If so, the benchmark may be measuring noise, not computation.
  4. Is the result consistent across forks? Large variance between forks suggests JIT instability.
  5. Does -XX:+PrintCompilation show the method being compiled? If the method is never compiled, you are measuring interpreter speed. If it is compiled and deoptimized repeatedly, the benchmark is unstable.