Constant Factors and Cache Effects: When Complexity Analysis Lies
Constant Factors and Cache Effects: When Complexity Analysis Lies
The main chapter showed that a LinkedList O(n) scan is 20x slower than an array O(n) scan due to cache effects. This section goes deeper: how to measure cache behavior, how to restructure data for cache efficiency, and how to eliminate the constant factors that dominate real-world performance.
Measuring the Cache Effect
The CPU’s performance counters expose cache hit/miss rates. JMH can report these through the -prof perfnorm profiler on Linux:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(value = 2, jvmArgsAppend = {"-XX:+UseCompressedOops"})
@State(Scope.Benchmark)
public class CacheMissBenchmark {
@Param({"1000000"})
int size;
int[] contiguousArray;
Integer[] boxedArray;
@Setup(Level.Trial)
public void setup() {
ThreadLocalRandom rng = ThreadLocalRandom.current();
contiguousArray = rng.ints(size).toArray();
boxedArray = new Integer[size];
// Allocate in random order to scatter on heap
int[] indices = new int[size];
for (int i = 0; i < size; i++) indices[i] = i;
for (int i = size - 1; i > 0; i--) {
int j = rng.nextInt(i + 1);
int tmp = indices[i]; indices[i] = indices[j]; indices[j] = tmp;
}
for (int idx : indices) {
boxedArray[idx] = contiguousArray[idx]; // Scattered allocation
}
}
@Benchmark
public long sumContiguous() {
long sum = 0;
for (int v : contiguousArray) {
sum += v; // FAST: L1 cache hits
}
return sum;
}
@Benchmark
public long sumBoxed() {
long sum = 0;
for (Integer v : boxedArray) {
sum += v; // SLOW: pointer chasing
}
return sum;
}
}
Running with -prof perfnorm produces per-operation hardware counter metrics:
| Metric | int[] | Integer[] |
|---|---|---|
| L1-dcache-load-misses / op | 0.06 | 4.2 |
| LLC-load-misses / op | 0.001 | 0.85 |
| Time (ns/element) | 0.28 | 4.8 |
The int[] scan misses L1 cache once per 16 elements (one cache line = 16 ints). The Integer[] scan misses on most elements because each Integer object is a separate heap allocation at a different address. The 17x performance difference maps directly to the 70x difference in L1 cache miss rate.
Struct-of-Arrays vs Array-of-Structs
Object-oriented design says “group related data in objects.” Cache-friendly design says “group same-field data in arrays.” These conflict.
The content platform’s article filter checks three fields: category, publish date, and view count. The natural Java approach stores these in Article objects (array of structs). The cache-friendly approach stores each field in its own array (struct of arrays):
// Array of Structs (AoS): natural OOP design
public class ArticleFilterAoS {
record Article(String category, long publishTime, int viewCount) {}
public List<Integer> filterHot(Article[] articles,
String targetCategory,
long minTime,
int minViews) {
List<Integer> result = new ArrayList<>();
for (int i = 0; i < articles.length; i++) {
Article a = articles[i]; // SLOW: loads entire object
if (a.category().equals(targetCategory)
&& a.publishTime() >= minTime
&& a.viewCount() >= minViews) {
result.add(i);
}
}
return result;
}
}
// Struct of Arrays (SoA): cache-friendly design
public class ArticleFilterSoA {
private final String[] categories;
private final long[] publishTimes;
private final int[] viewCounts;
public ArticleFilterSoA(int size) {
this.categories = new String[size];
this.publishTimes = new long[size];
this.viewCounts = new int[size];
}
public List<Integer> filterHot(String targetCategory,
long minTime,
int minViews) {
List<Integer> result = new ArrayList<>();
int n = categories.length;
// Phase 1: Filter by category (String comparison)
// Only loads category array into cache
for (int i = 0; i < n; i++) {
if (!categories[i].equals(targetCategory)) continue; // FAST
// Phase 2: Check time (only for matching categories)
if (publishTimes[i] < minTime) continue;
// Phase 3: Check views (only for matching category + time)
if (viewCounts[i] >= minViews) {
result.add(i);
}
}
return result;
}
}
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class SoAvsAoSBenchmark {
@Param({"500000"})
int size;
ArticleFilterAoS.Article[] aosArticles;
ArticleFilterSoA soaArticles;
String targetCategory;
long minTime;
int minViews;
@Setup(Level.Trial)
public void setup() {
String[] cats = {"java", "python", "rust", "go", "typescript",
"kotlin", "scala", "c++", "c#", "ruby"};
ThreadLocalRandom rng = ThreadLocalRandom.current();
aosArticles = new ArticleFilterAoS.Article[size];
soaArticles = new ArticleFilterSoA(size);
for (int i = 0; i < size; i++) {
String cat = cats[rng.nextInt(cats.length)];
long time = 1_700_000_000_000L + rng.nextLong(86_400_000L * 365);
int views = rng.nextInt(100_000);
aosArticles[i] = new ArticleFilterAoS.Article(cat, time, views);
soaArticles.categories()[i] = cat;
soaArticles.publishTimes()[i] = time;
soaArticles.viewCounts()[i] = views;
}
targetCategory = "java";
minTime = 1_700_000_000_000L + 86_400_000L * 300;
minViews = 1000;
}
@Benchmark
public List<Integer> filterAoS() {
return new ArticleFilterAoS().filterHot(
aosArticles, targetCategory, minTime, minViews);
}
@Benchmark
public List<Integer> filterSoA() {
return soaArticles.filterHot(targetCategory, minTime, minViews);
}
}
| Approach | 500K articles (us) | Speedup |
|---|---|---|
| Array of Structs (AoS) | 3,200 | baseline |
| Struct of Arrays (SoA) | 980 | 3.3x faster |
The SoA approach is 3.3x faster because:
- Category filter: Scans only the
categoriesarray. Each cache line loads 4-8 String references (vs loading entire Article objects that waste cache space on unused fields). - Short-circuiting: Articles that fail the category check (90% of them, since 10 categories and we want one) never touch the
publishTimesorviewCountsarrays. Those arrays stay out of cache. - Sequential access: Each array is accessed sequentially, enabling hardware prefetching.
Trade-off: SoA code is less readable and harder to maintain. Use SoA only for hot-path filters on large datasets (100K+ elements). For small datasets or cold paths, AoS is cleaner and fast enough.
Branch Prediction: Sorted vs Unsorted Filtering
CPUs predict branch outcomes (if/else) based on recent history. Predictable branches (always true, always false, or repeating patterns) are essentially free. Unpredictable branches (50/50 random) cost ~15-20 cycles per misprediction.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class BranchPredictionBenchmark {
@Param({"1000000"})
int size;
int[] sortedData;
int[] unsortedData;
int threshold;
@Setup(Level.Trial)
public void setup() {
ThreadLocalRandom rng = ThreadLocalRandom.current();
unsortedData = rng.ints(size, 0, 100).toArray();
sortedData = unsortedData.clone();
Arrays.sort(sortedData);
threshold = 50; // ~50% of elements pass
}
@Benchmark
public long filterSorted() {
long sum = 0;
for (int v : sortedData) {
if (v >= threshold) { // FAST: predictable branch
sum += v; // All false, then all true
}
}
return sum;
}
@Benchmark
public long filterUnsorted() {
long sum = 0;
for (int v : unsortedData) {
if (v >= threshold) { // SLOW: unpredictable branch
sum += v; // Random true/false
}
}
return sum;
}
@Benchmark
public long filterBranchless() {
long sum = 0;
for (int v : unsortedData) {
// Branchless: arithmetic instead of branch
int mask = (threshold - v - 1) >> 31; // 0xFFFFFFFF if v >= threshold
sum += v & mask; // FAST: no branch prediction
}
return sum;
}
}
| Approach | Time (us) |
|---|---|
| Filter sorted data | 620 |
| Filter unsorted data | 2,400 |
| Filter branchless (unsorted) | 710 |
Filtering sorted data is 3.9x faster than unsorted because the branch predictor sees a long run of “not taken” followed by a long run of “taken.” The branchless version eliminates the branch entirely using arithmetic, achieving near-sorted performance on unsorted data.
Applicability: The content platform’s view count filter (articles with > 1000 views) operates on unsorted data. The branchless technique is worth applying when:
- The filter selectivity is near 50% (worst case for branch prediction)
- The dataset is large (100K+ elements)
- The filter expression is simple (integer comparison)
For filters with 90% selectivity (most elements pass or most fail), the branch predictor handles it well and branchless code provides no benefit.
Allocation-Free Algorithms
Object allocation in Java is fast (~10ns). Garbage collection of short-lived objects is also fast (young generation collection). But at scale, allocation pressure slows everything: more GC pauses, more cache pollution from dead objects, more memory bandwidth consumed by copying.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class AllocationFreeBenchmark {
@Param({"100000"})
int size;
int[] data;
@Setup(Level.Trial)
public void setup() {
data = ThreadLocalRandom.current().ints(size, 0, 1000).toArray();
}
@Benchmark
public int[] filterWithAllocation() {
// SLOW: Stream creates Iterator, Spliterator, boxed Integers, list nodes
return IntStream.of(data)
.filter(v -> v > 500)
.toArray();
}
@Benchmark
public int[] filterAllocationFree() {
// FAST: No allocation during filter, single array allocation for result
int count = 0;
for (int v : data) {
if (v > 500) count++;
}
int[] result = new int[count];
int idx = 0;
for (int v : data) {
if (v > 500) result[idx++] = v;
}
return result;
}
}
| Approach | 100K elements (us) | Allocations |
|---|---|---|
| IntStream.filter.toArray | 185 | Spliterator + internal buffer resizes |
| Two-pass allocation-free | 68 | One array |
The two-pass approach is 2.7x faster. The first pass counts matches (no allocation). The second pass fills a pre-sized array (one allocation of exact size). IntStream is not terrible here because IntStream avoids boxing, but it still allocates internal buffers and uses spliterator machinery.
For Stream<Integer> (boxed), the gap widens to 5-8x due to autoboxing each element.
The Constant Factor Checklist
Before optimizing an algorithm’s complexity class, check these constant factors:
| Factor | Symptom | Fix | Typical Impact |
|---|---|---|---|
| Pointer chasing | High L1 miss rate | Use arrays instead of linked structures | 5-20x |
| Scattered objects | High LLC miss rate | SoA layout, parallel arrays | 2-4x |
| Branch misprediction | High branch-miss count | Sort data, branchless code | 2-4x |
| Allocation pressure | Frequent young GC | Pre-allocate, two-pass algorithms | 2-3x |
| Autoboxing | Excessive Integer/Long/Double allocation | Primitive arrays and streams | 3-5x |
| Comparator indirection | Polymorphic call sites | Inline comparators | 1.2-1.5x |
These factors multiply. An algorithm suffering from pointer chasing (5x), branch misprediction (3x), and allocation pressure (2x) has a combined constant factor of 30x. An O(n) algorithm with a 30x constant factor loses to an O(n log n) algorithm with a 1x constant factor for all datasets under 2^30 elements.
The content platform’s recommendation engine had exactly this profile before optimization. The filter step used LinkedList iteration (pointer chasing) with Stream.filter (allocation) on unsorted candidates (branch misprediction). Switching to pre-sorted int[] arrays with manual iteration reduced the filter step from 580us to 28us: a 20x improvement without changing the O(n) complexity class.