Skip to main content
fast by design

Setting Up the Measurement Stack

9 min read Chapter 3 of 90

Setting Up the Measurement Stack

This section sets up the three tools you will use throughout the book. By the end, you will have a working JMH benchmark project, async-profiler attached to a JVM, and a Locust load test producing baseline numbers for the content platform.

JMH Project Setup

JMH (Java Microbenchmark Harness) is developed by the OpenJDK team. It handles JIT warmup, dead code elimination prevention, result aggregation, and statistical analysis. Every microbenchmark in this book uses JMH. No exception.

Create a Maven project for the benchmarks:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.contentplatform</groupId>
    <artifactId>benchmarks</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <properties>
        <java.version>21</java.version>
        <jmh.version>1.37</jmh.version>
        <maven.compiler.source>${java.version}</maven.compiler.source>
        <maven.compiler.target>${java.version}</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-core</artifactId>
            <version>${jmh.version}</version>
        </dependency>
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-generator-annprocess</artifactId>
            <version>${jmh.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.17.0</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.datatype</groupId>
            <artifactId>jackson-datatype-jsr310</artifactId>
            <version>2.17.0</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.5.2</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <finalName>benchmarks</finalName>
                            <transformers>
                                <transformer implementation=
                                    "org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>org.openjdk.jmh.Main</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

The shade plugin creates a fat JAR with JMH’s Main class as the entry point. This is required because JMH generates benchmark classes at compile time via the annotation processor, and those generated classes must be on the classpath when the benchmark runs.

Build and verify:

mvn clean package -DskipTests
java -jar target/benchmarks.jar -l

The -l flag lists all discovered benchmarks. If you see your benchmark class names, the setup is correct.

Your First JMH Benchmark

Write a benchmark that measures JSON serialization of an Article object. This benchmark will be the baseline for serialization optimizations in later chapters.

package com.contentplatform.benchmarks;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.datatype.jsr310.JavaTimeModule;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.time.Instant;
import java.util.List;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Benchmark)
public class ArticleSerializationBenchmark {

    private ObjectMapper mapper;
    private Article article;
    private byte[] serialized;

    @Setup(Level.Trial)
    public void setup() throws Exception {
        mapper = new ObjectMapper();
        mapper.registerModule(new JavaTimeModule());
        article = new Article(
            1L,
            "Performance Engineering for Java Systems",
            "A".repeat(10_000),  // 10KB body, typical article size
            "perf-engineering",
            List.of("java", "performance", "jvm"),
            List.of("benchmarking", "profiling", "optimization"),
            42L,
            Instant.now(),
            Instant.now()
        );
        serialized = mapper.writeValueAsBytes(article);
    }

    @Benchmark
    public byte[] serialize() throws Exception {
        return mapper.writeValueAsBytes(article);
    }

    @Benchmark
    public Article deserialize() throws Exception {
        return mapper.readValue(serialized, Article.class);
    }
}

Run it:

java -jar target/benchmarks.jar ArticleSerializationBenchmark \
     -rf json -rff results.json

Expected output (numbers vary by hardware):

Benchmark                                          Mode  Cnt      Score     Error  Units
ArticleSerializationBenchmark.serialize            avgt   10   8432.241 ± 124.556  ns/op
ArticleSerializationBenchmark.deserialize          avgt   10  12847.392 ± 287.113  ns/op

Write these numbers down. They are your baseline. When Chapter 6 introduces serialization optimizations, you will compare against these numbers.

The -rf json -rff results.json flags write results in JSON format. This matters for automation. Later chapters show how to integrate JMH results into CI pipelines to catch performance regressions.

async-profiler Installation

async-profiler requires Linux with perf_events support, or macOS with dtrace. On Linux, you need kernel version 4.6 or later and the following sysctl setting:

# Allow non-root users to use perf_events
sudo sysctl kernel.perf_event_paranoid=1

# Or, for container environments:
sudo sysctl kernel.perf_event_paranoid=-1

Download and install:

# Download the latest release (check GitHub for current version)
wget https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz
tar xzf async-profiler-3.0-linux-x64.tar.gz
cd async-profiler-3.0-linux-x64

# Verify installation
./asprof --version

Test with a running Java process:

# Start any Java application
java -jar target/benchmarks.jar ArticleSerializationBenchmark &
PID=$!

# Profile for 10 seconds
./asprof -d 10 -f /tmp/test-profile.html $PID

# Open the flame graph in a browser
open /tmp/test-profile.html  # macOS
xdg-open /tmp/test-profile.html  # Linux

If you see a flame graph with Java frames, async-profiler is working. If you see only native frames without Java method names, the JVM’s debug symbols are not accessible. Add -XX:+PreserveFramePointer to the JVM flags:

java -XX:+PreserveFramePointer -jar target/benchmarks.jar \
     ArticleSerializationBenchmark &

For production JVMs, PreserveFramePointer has a measurable but small overhead (typically 1-3% on CPU-bound workloads). The benefit of accurate flame graphs in production outweighs this cost. If your application is latency-sensitive at the microsecond level, benchmark the overhead for your specific workload before enabling it permanently.

async-profiler in Containers

If your content platform runs in Docker, you need additional configuration:

# Docker run with required capabilities
docker run --cap-add SYS_PTRACE \
           --security-opt seccomp=unconfined \
           -v /tmp/async-profiler:/profiler \
           your-content-platform:latest

# Or in docker-compose.yml
services:
  content-platform:
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp:unconfined

Without SYS_PTRACE, async-profiler cannot attach to the JVM process. Without disabling seccomp, perf_events syscalls are blocked. These settings are for profiling environments only. Do not weaken security in production unless you understand the implications.

Locust Installation and Baseline

Install Locust:

pip install locust

Create the baseline test file. The Locust script from the main chapter body defines the content platform’s traffic pattern. Save it as locust_baseline.py and run the first baseline:

# Start your content platform application first
# Then run the baseline test
locust -f locust_baseline.py \
       --host http://localhost:8080 \
       --users 50 \
       --spawn-rate 5 \
       --run-time 3m \
       --headless \
       --csv baseline_50users

Start with 50 concurrent users. Increase gradually to find the saturation point:

# 100 users
locust -f locust_baseline.py \
       --host http://localhost:8080 \
       --users 100 \
       --spawn-rate 10 \
       --run-time 3m \
       --headless \
       --csv baseline_100users

# 200 users
locust -f locust_baseline.py \
       --host http://localhost:8080 \
       --users 200 \
       --spawn-rate 20 \
       --run-time 3m \
       --headless \
       --csv baseline_200users

The saturation point is the user count where p99 latency exceeds your target. For the content platform, the targets are:

Endpointp50 Targetp99 Target
GET /api/articles/[id]30ms100ms
GET /api/articles/search100ms300ms
GET /api/articles/[id]/recommendations50ms200ms
POST /api/articles/[id]/view10ms50ms

Record the user count where each endpoint exceeds its p99 target. This is the capacity limit of your system. Every optimization in this book pushes this limit higher.

Reading Locust Results

After a Locust run, you get three CSV files:

  • baseline_stats.csv: Per-endpoint summary with percentile latencies
  • baseline_stats_history.csv: Time-series data showing latency over time
  • baseline_failures.csv: Failed requests and error details

The stats CSV tells you where you stand. The history CSV tells you whether latency was stable or degrading. If latency increases linearly with test duration, you have a resource leak (memory, connections, file handles). If latency spikes periodically, you have a GC problem or a background task interfering with request processing.

# Analyze baseline results
import csv

def print_baseline(csv_path):
    """Print the baseline stats in a readable format."""
    with open(csv_path) as f:
        reader = csv.DictReader(f)
        print(f"{'Endpoint':<45} {'Avg':>6} {'p50':>6} {'p90':>6} "
              f"{'p95':>6} {'p99':>6} {'RPS':>8}")
        print("-" * 100)
        for row in reader:
            if row["Name"] != "Aggregated":
                print(f"{row['Name']:<45} "
                      f"{row['Average Response Time']:>6} "
                      f"{row['50%']:>6} "
                      f"{row['90%']:>6} "
                      f"{row['95%']:>6} "
                      f"{row['99%']:>6} "
                      f"{row['Requests/s']:>8}")

print_baseline("baseline_100users_stats.csv")

Combining the Tools

The measurement stack works together. Here is the workflow you will repeat throughout the book:

Step 1: Establish the baseline with Locust.

locust -f locust_baseline.py --host http://localhost:8080 \
       --users 100 --spawn-rate 10 --run-time 5m \
       --headless --csv before

Step 2: While Locust is running, profile with async-profiler.

# Find the JVM PID
jps | grep ContentPlatformApplication

# Profile CPU for 30 seconds
./asprof -d 30 -f /tmp/before-cpu.html <pid>

# Profile allocations for 30 seconds
./asprof -e alloc -d 30 -f /tmp/before-alloc.html <pid>

Step 3: Read the flame graphs. Identify the bottleneck.

The widest top-level frame in the CPU flame graph is your primary target. The widest frame in the allocation flame graph shows where objects are created most frequently.

Step 4: Write a JMH benchmark for the bottleneck.

Isolate the slow operation. Benchmark it. Get a number in nanoseconds per operation.

Step 5: Apply the fix.

Change the code. The fix might be an algorithm change, an index, a cache, a batch query, or a data structure swap.

Step 6: Re-benchmark with JMH.

Run the same JMH benchmark. Compare the before and after numbers.

Step 7: Re-run Locust.

locust -f locust_baseline.py --host http://localhost:8080 \
       --users 100 --spawn-rate 10 --run-time 5m \
       --headless --csv after

Step 8: Compare.

# Compare before and after Locust runs
def compare_runs(before_csv, after_csv):
    """Compare two Locust runs."""
    before = {}
    after = {}

    with open(before_csv) as f:
        for row in csv.DictReader(f):
            before[row["Name"]] = row

    with open(after_csv) as f:
        for row in csv.DictReader(f):
            after[row["Name"]] = row

    print(f"{'Endpoint':<40} {'p99 Before':>10} {'p99 After':>10} "
          f"{'Change':>10}")
    print("-" * 80)
    for name in before:
        if name in after and name != "Aggregated":
            b = int(before[name]["99%"])
            a = int(after[name]["99%"])
            pct = ((a - b) / b) * 100
            print(f"{name:<40} {b:>8}ms {a:>8}ms "
                  f"{pct:>+8.1f}%")

compare_runs("before_stats.csv", "after_stats.csv")

This workflow produces four artifacts: the before Locust CSV, the before flame graph, the after Locust CSV, and the JMH benchmark results. These artifacts are your evidence. They survive code reviews, architecture discussions, and manager questions. “I profiled the system and the flame graph showed X. I benchmarked the fix and it improved by Y. The Locust test confirmed Z improvement at the system level.” This sentence ends every performance investigation.

Verifying Your Setup

Run this checklist before proceeding to Chapter 2:

  • java -jar target/benchmarks.jar -l lists your benchmark classes
  • java -jar target/benchmarks.jar ArticleSerializationBenchmark completes without errors and produces nanosecond-precision results
  • ./asprof --version prints the async-profiler version
  • ./asprof -d 10 -f /tmp/test.html <pid> produces an HTML flame graph with Java frame names
  • locust -f locust_baseline.py --host http://localhost:8080 --users 10 --spawn-rate 2 --run-time 30s --headless completes and produces CSV files
  • You have recorded baseline numbers for all four content platform endpoints

If any of these fail, fix them before continuing. The rest of this book assumes these tools work.