Skip to main content
fast by design

Constant Factors and Cache Effects: When Complexity Analysis Lies

9 min read Chapter 38 of 90

Constant Factors and Cache Effects: When Complexity Analysis Lies

The main chapter showed that a LinkedList O(n) scan is 20x slower than an array O(n) scan due to cache effects. This section goes deeper: how to measure cache behavior, how to restructure data for cache efficiency, and how to eliminate the constant factors that dominate real-world performance.

Measuring the Cache Effect

The CPU’s performance counters expose cache hit/miss rates. JMH can report these through the -prof perfnorm profiler on Linux:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(value = 2, jvmArgsAppend = {"-XX:+UseCompressedOops"})
@State(Scope.Benchmark)
public class CacheMissBenchmark {

    @Param({"1000000"})
    int size;

    int[] contiguousArray;
    Integer[] boxedArray;

    @Setup(Level.Trial)
    public void setup() {
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        contiguousArray = rng.ints(size).toArray();
        boxedArray = new Integer[size];
        // Allocate in random order to scatter on heap
        int[] indices = new int[size];
        for (int i = 0; i < size; i++) indices[i] = i;
        for (int i = size - 1; i > 0; i--) {
            int j = rng.nextInt(i + 1);
            int tmp = indices[i]; indices[i] = indices[j]; indices[j] = tmp;
        }
        for (int idx : indices) {
            boxedArray[idx] = contiguousArray[idx]; // Scattered allocation
        }
    }

    @Benchmark
    public long sumContiguous() {
        long sum = 0;
        for (int v : contiguousArray) {
            sum += v;                                // FAST: L1 cache hits
        }
        return sum;
    }

    @Benchmark
    public long sumBoxed() {
        long sum = 0;
        for (Integer v : boxedArray) {
            sum += v;                                // SLOW: pointer chasing
        }
        return sum;
    }
}

Running with -prof perfnorm produces per-operation hardware counter metrics:

Metricint[]Integer[]
L1-dcache-load-misses / op0.064.2
LLC-load-misses / op0.0010.85
Time (ns/element)0.284.8

The int[] scan misses L1 cache once per 16 elements (one cache line = 16 ints). The Integer[] scan misses on most elements because each Integer object is a separate heap allocation at a different address. The 17x performance difference maps directly to the 70x difference in L1 cache miss rate.

Struct-of-Arrays vs Array-of-Structs

Object-oriented design says “group related data in objects.” Cache-friendly design says “group same-field data in arrays.” These conflict.

The content platform’s article filter checks three fields: category, publish date, and view count. The natural Java approach stores these in Article objects (array of structs). The cache-friendly approach stores each field in its own array (struct of arrays):

// Array of Structs (AoS): natural OOP design
public class ArticleFilterAoS {

    record Article(String category, long publishTime, int viewCount) {}

    public List<Integer> filterHot(Article[] articles,
                                    String targetCategory,
                                    long minTime,
                                    int minViews) {
        List<Integer> result = new ArrayList<>();
        for (int i = 0; i < articles.length; i++) {
            Article a = articles[i];                         // SLOW: loads entire object
            if (a.category().equals(targetCategory)
                && a.publishTime() >= minTime
                && a.viewCount() >= minViews) {
                result.add(i);
            }
        }
        return result;
    }
}
// Struct of Arrays (SoA): cache-friendly design
public class ArticleFilterSoA {

    private final String[] categories;
    private final long[] publishTimes;
    private final int[] viewCounts;

    public ArticleFilterSoA(int size) {
        this.categories = new String[size];
        this.publishTimes = new long[size];
        this.viewCounts = new int[size];
    }

    public List<Integer> filterHot(String targetCategory,
                                    long minTime,
                                    int minViews) {
        List<Integer> result = new ArrayList<>();
        int n = categories.length;

        // Phase 1: Filter by category (String comparison)
        // Only loads category array into cache
        for (int i = 0; i < n; i++) {
            if (!categories[i].equals(targetCategory)) continue;  // FAST

            // Phase 2: Check time (only for matching categories)
            if (publishTimes[i] < minTime) continue;

            // Phase 3: Check views (only for matching category + time)
            if (viewCounts[i] >= minViews) {
                result.add(i);
            }
        }
        return result;
    }
}
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class SoAvsAoSBenchmark {

    @Param({"500000"})
    int size;

    ArticleFilterAoS.Article[] aosArticles;
    ArticleFilterSoA soaArticles;
    String targetCategory;
    long minTime;
    int minViews;

    @Setup(Level.Trial)
    public void setup() {
        String[] cats = {"java", "python", "rust", "go", "typescript",
                         "kotlin", "scala", "c++", "c#", "ruby"};
        ThreadLocalRandom rng = ThreadLocalRandom.current();

        aosArticles = new ArticleFilterAoS.Article[size];
        soaArticles = new ArticleFilterSoA(size);

        for (int i = 0; i < size; i++) {
            String cat = cats[rng.nextInt(cats.length)];
            long time = 1_700_000_000_000L + rng.nextLong(86_400_000L * 365);
            int views = rng.nextInt(100_000);

            aosArticles[i] = new ArticleFilterAoS.Article(cat, time, views);
            soaArticles.categories()[i] = cat;
            soaArticles.publishTimes()[i] = time;
            soaArticles.viewCounts()[i] = views;
        }

        targetCategory = "java";
        minTime = 1_700_000_000_000L + 86_400_000L * 300;
        minViews = 1000;
    }

    @Benchmark
    public List<Integer> filterAoS() {
        return new ArticleFilterAoS().filterHot(
            aosArticles, targetCategory, minTime, minViews);
    }

    @Benchmark
    public List<Integer> filterSoA() {
        return soaArticles.filterHot(targetCategory, minTime, minViews);
    }
}
Approach500K articles (us)Speedup
Array of Structs (AoS)3,200baseline
Struct of Arrays (SoA)9803.3x faster

The SoA approach is 3.3x faster because:

  1. Category filter: Scans only the categories array. Each cache line loads 4-8 String references (vs loading entire Article objects that waste cache space on unused fields).
  2. Short-circuiting: Articles that fail the category check (90% of them, since 10 categories and we want one) never touch the publishTimes or viewCounts arrays. Those arrays stay out of cache.
  3. Sequential access: Each array is accessed sequentially, enabling hardware prefetching.

Trade-off: SoA code is less readable and harder to maintain. Use SoA only for hot-path filters on large datasets (100K+ elements). For small datasets or cold paths, AoS is cleaner and fast enough.

Branch Prediction: Sorted vs Unsorted Filtering

CPUs predict branch outcomes (if/else) based on recent history. Predictable branches (always true, always false, or repeating patterns) are essentially free. Unpredictable branches (50/50 random) cost ~15-20 cycles per misprediction.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class BranchPredictionBenchmark {

    @Param({"1000000"})
    int size;

    int[] sortedData;
    int[] unsortedData;
    int threshold;

    @Setup(Level.Trial)
    public void setup() {
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        unsortedData = rng.ints(size, 0, 100).toArray();
        sortedData = unsortedData.clone();
        Arrays.sort(sortedData);
        threshold = 50; // ~50% of elements pass
    }

    @Benchmark
    public long filterSorted() {
        long sum = 0;
        for (int v : sortedData) {
            if (v >= threshold) {              // FAST: predictable branch
                sum += v;                      // All false, then all true
            }
        }
        return sum;
    }

    @Benchmark
    public long filterUnsorted() {
        long sum = 0;
        for (int v : unsortedData) {
            if (v >= threshold) {              // SLOW: unpredictable branch
                sum += v;                      // Random true/false
            }
        }
        return sum;
    }

    @Benchmark
    public long filterBranchless() {
        long sum = 0;
        for (int v : unsortedData) {
            // Branchless: arithmetic instead of branch
            int mask = (threshold - v - 1) >> 31;  // 0xFFFFFFFF if v >= threshold
            sum += v & mask;                        // FAST: no branch prediction
        }
        return sum;
    }
}
ApproachTime (us)
Filter sorted data620
Filter unsorted data2,400
Filter branchless (unsorted)710

Filtering sorted data is 3.9x faster than unsorted because the branch predictor sees a long run of “not taken” followed by a long run of “taken.” The branchless version eliminates the branch entirely using arithmetic, achieving near-sorted performance on unsorted data.

Applicability: The content platform’s view count filter (articles with > 1000 views) operates on unsorted data. The branchless technique is worth applying when:

  1. The filter selectivity is near 50% (worst case for branch prediction)
  2. The dataset is large (100K+ elements)
  3. The filter expression is simple (integer comparison)

For filters with 90% selectivity (most elements pass or most fail), the branch predictor handles it well and branchless code provides no benefit.

Allocation-Free Algorithms

Object allocation in Java is fast (~10ns). Garbage collection of short-lived objects is also fast (young generation collection). But at scale, allocation pressure slows everything: more GC pauses, more cache pollution from dead objects, more memory bandwidth consumed by copying.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class AllocationFreeBenchmark {

    @Param({"100000"})
    int size;

    int[] data;

    @Setup(Level.Trial)
    public void setup() {
        data = ThreadLocalRandom.current().ints(size, 0, 1000).toArray();
    }

    @Benchmark
    public int[] filterWithAllocation() {
        // SLOW: Stream creates Iterator, Spliterator, boxed Integers, list nodes
        return IntStream.of(data)
            .filter(v -> v > 500)
            .toArray();
    }

    @Benchmark
    public int[] filterAllocationFree() {
        // FAST: No allocation during filter, single array allocation for result
        int count = 0;
        for (int v : data) {
            if (v > 500) count++;
        }
        int[] result = new int[count];
        int idx = 0;
        for (int v : data) {
            if (v > 500) result[idx++] = v;
        }
        return result;
    }
}
Approach100K elements (us)Allocations
IntStream.filter.toArray185Spliterator + internal buffer resizes
Two-pass allocation-free68One array

The two-pass approach is 2.7x faster. The first pass counts matches (no allocation). The second pass fills a pre-sized array (one allocation of exact size). IntStream is not terrible here because IntStream avoids boxing, but it still allocates internal buffers and uses spliterator machinery.

For Stream<Integer> (boxed), the gap widens to 5-8x due to autoboxing each element.

The Constant Factor Checklist

Before optimizing an algorithm’s complexity class, check these constant factors:

FactorSymptomFixTypical Impact
Pointer chasingHigh L1 miss rateUse arrays instead of linked structures5-20x
Scattered objectsHigh LLC miss rateSoA layout, parallel arrays2-4x
Branch mispredictionHigh branch-miss countSort data, branchless code2-4x
Allocation pressureFrequent young GCPre-allocate, two-pass algorithms2-3x
AutoboxingExcessive Integer/Long/Double allocationPrimitive arrays and streams3-5x
Comparator indirectionPolymorphic call sitesInline comparators1.2-1.5x

These factors multiply. An algorithm suffering from pointer chasing (5x), branch misprediction (3x), and allocation pressure (2x) has a combined constant factor of 30x. An O(n) algorithm with a 30x constant factor loses to an O(n log n) algorithm with a 1x constant factor for all datasets under 2^30 elements.

The content platform’s recommendation engine had exactly this profile before optimization. The filter step used LinkedList iteration (pointer chasing) with Stream.filter (allocation) on unsorted candidates (branch misprediction). Switching to pre-sorted int[] arrays with manual iteration reduced the filter step from 580us to 28us: a 20x improvement without changing the O(n) complexity class.