Object Allocation and Memory Layout: Cache Lines, False Sharing, and the Cost of Pointer Chasing
Object Allocation and Memory Layout: Cache Lines, False Sharing, and the Cost of Pointer Chasing
The CPU does not access memory. The CPU accesses cache. When the data it needs is in L1 cache, it gets the data in 1-2 nanoseconds. When the data is in main memory, it waits 50-100 nanoseconds. That 50x penalty determines which data structures are fast and which are slow, regardless of their algorithmic complexity.
Java’s object model creates a tension. Objects have headers. Fields have alignment requirements. References add indirection. A LinkedList is O(1) for insertion, but each node access is a cache miss, making iteration 14x slower than ArrayList for the same data.
The content platform iterates over article lists, view count arrays, and recommendation scores millions of times per second. The difference between a cache-friendly layout and a cache-hostile layout is measured in response time.
Java Object Headers
Every Java object starts with a header. On a 64-bit JVM with compressed oops (the default for heaps under 32GB), the header is 12 bytes:
- Mark word (8 bytes): Hash code, GC age, lock state, biased locking info
- Klass pointer (4 bytes, compressed): Pointer to the class metadata
The JVM pads the total object size to a multiple of 8 bytes. A Point object with two int fields:
class Point { int x; int y; }
Layout:
Offset Size Field
0 8 Mark word
8 4 Klass pointer (compressed)
12 4 int x
16 4 int y
20 4 Padding (to align to 8 bytes)
Total: 24 bytes
24 bytes to store 8 bytes of data. The overhead ratio is 3:1. For small objects, the header dominates the memory footprint.
An array of 1000 Point objects occupies: 1000 * 24 bytes (objects) + 16 bytes (array header) + 1000 * 4 bytes (references) = 28,016 bytes. Only 8,000 bytes are actual data. The rest is headers, padding, and references.
Measuring Object Layout with JOL
The JOL (Java Object Layout) tool shows the exact memory layout of any object:
// Add JOL dependency: org.openjdk.jol:jol-core
import org.openjdk.jol.info.ClassLayout;
public class LayoutInspector {
public static void main(String[] args) {
System.out.println(ClassLayout.parseInstance(new Article(
"Test Title",
"Test content for the article",
List.of("java"),
Instant.now(),
42
)).toPrintable());
}
record Article(String title, String content, List<String> tags,
Instant publishedAt, int viewCount) {}
}
Output (with compressed oops):
Article object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark)
8 4 (object header: klass)
12 4 int Article.viewCount
16 4 String Article.title (object ref)
20 4 String Article.content (object ref)
24 4 List Article.tags (object ref)
28 4 Instant Article.publishedAt (object ref)
Instance size: 32 bytes
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
32 bytes for the Article object itself. But this object contains four references to other objects. The String title points to a separate 24-byte String object, which points to a separate byte[] array. Following all the references to compute the deep size:
- Article: 32 bytes
- title String: 24 bytes + byte[] (title.length + 16 header)
- content String: 24 bytes + byte[] (content.length + 16 header)
- tags List: 24 bytes + Object[] (tags.length * 4 + 16 header) + tag Strings
- publishedAt Instant: 24 bytes
A single Article with a 1KB content body occupies approximately 1.2KB spread across 8+ separate heap objects. Iterating over the Article and accessing all its fields requires following 5+ pointers, each potentially a cache miss.
Cache Lines and Access Patterns
Modern x86 CPUs use 64-byte cache lines. When the CPU accesses a byte of memory, it loads the entire 64-byte cache line containing that byte into L1 cache. Subsequent accesses to other bytes in the same cache line are free.
This diagram shows three key memory layout concepts. First, an ArrayList’s backing array stores compressed object references contiguously: 12 references fit in a single 64-byte cache line, making sequential scans prefetcher-friendly with only 1 cache miss per 12 references. Second, a LinkedList’s nodes are scattered across the heap, each 48-byte node in a different cache line, requiring a pointer chase (dependent load) per element. Third, the false sharing problem where two threads writing to different fields in the same cache line cause constant MESI invalidation traffic, and the fix using @Contended annotation to add 128 bytes of padding between the fields.
ArrayList vs LinkedList: The Cache Reality
The algorithmic complexity argument (ArrayList O(n) remove vs LinkedList O(1) remove) is irrelevant for iteration-dominated workloads. The cache behavior dominates.
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ListIterationBenchmark {
@Param({"1000", "10000", "100000"})
private int size;
private List<Long> arrayList;
private List<Long> linkedList;
@Setup
public void setup() {
arrayList = new ArrayList<>(size);
linkedList = new LinkedList<>();
Random r = new Random(42);
for (int i = 0; i < size; i++) {
long val = r.nextLong();
arrayList.add(val);
linkedList.add(val);
}
}
@Benchmark
public long sumArrayList() {
long sum = 0;
for (Long val : arrayList) {
sum += val;
}
return sum;
}
@Benchmark
public long sumLinkedList() {
long sum = 0;
for (Long val : linkedList) {
sum += val;
}
return sum;
}
}
Results:
Benchmark (size) Mode Cnt Score Error Units
sumArrayList 1000 avgt 30 330.2 ± 8.1 ns/op
sumLinkedList 1000 avgt 30 4,287.5 ± 112.4 ns/op
sumArrayList 10000 avgt 30 3,124.8 ± 45.2 ns/op
sumLinkedList 10000 avgt 30 58,342.1 ± 1,234.5 ns/op
sumArrayList 100000 avgt 30 31,456.2 ± 567.8 ns/op
sumLinkedList 100000 avgt 30 824,125.3 ± 15,432.1 ns/op
LinkedList iteration is 13-26x slower than ArrayList iteration. The ratio increases with size because larger data sets exceed L2 and L3 cache, making each pointer chase more expensive.
For the content platform, this means: never use LinkedList for any collection that is iterated. Article lists, search results, recommendation lists, tag collections: all ArrayList or arrays.
Struct of Arrays vs Array of Structs
The Java object model encourages the “array of structs” pattern: List<Article> where each Article is a separate heap object. Iterating over a single field (e.g., summing view counts) loads entire Article objects into cache, wasting cache line space on unused fields.
The “struct of arrays” pattern stores each field in its own array. Iterating over one field loads only that field’s data, maximizing cache utilization.
// SLOW: Array of structs (standard Java pattern)
// Iterating view counts loads title, content, tags, publishedAt into cache
public long totalViewCounts(List<Article> articles) {
long total = 0;
for (Article a : articles) {
total += a.viewCount(); // Loads entire Article object (32+ bytes) for 4 bytes of data
}
return total;
}
// FAST: Struct of arrays (cache-friendly for single-field access)
public class ArticleStore {
private String[] titles;
private String[] contents;
private int[] viewCounts; // Contiguous int array
private long[] publishedAts;
private int size;
public long totalViewCounts() {
long total = 0;
for (int i = 0; i < size; i++) {
total += viewCounts[i]; // 16 ints per cache line, sequential access
}
return total;
}
}
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class StructOfArraysBenchmark {
private static final int SIZE = 100_000;
private Article[] articles;
private int[] viewCounts;
@Setup
public void setup() {
Random r = new Random(42);
articles = new Article[SIZE];
viewCounts = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
int vc = r.nextInt(100_000);
articles[i] = new Article("title", "content", List.of(), Instant.now(), vc);
viewCounts[i] = vc;
}
}
@Benchmark
public long sumArrayOfStructs() {
long total = 0;
for (int i = 0; i < SIZE; i++) {
total += articles[i].viewCount();
}
return total;
}
@Benchmark
public long sumStructOfArrays() {
long total = 0;
for (int i = 0; i < SIZE; i++) {
total += viewCounts[i];
}
return total;
}
record Article(String title, String content, List<String> tags,
Instant publishedAt, int viewCount) {}
}
Results:
Benchmark Mode Cnt Score Error Units
sumArrayOfStructs avgt 30 68,234.1 ± 1,234 ns/op
sumStructOfArrays avgt 30 12,456.2 ± 234 ns/op
The struct-of-arrays version is 5.5x faster. The int[] array stores 16 integers per cache line. Sequential iteration hits the hardware prefetcher perfectly: the CPU predicts the access pattern and pre-loads the next cache line before the code requests it.
The array-of-structs version chases a reference for each element. Each articles[i] is a pointer load, and the Article object may be anywhere in the heap. Even if the Article objects were allocated sequentially (TLAB allocation makes this likely for new objects), they are 32 bytes each, so only 2 fit per cache line. And the viewCount field is at offset 12, wasting the first 12 bytes of each loaded cache line.
Pointer Chasing: The Real Cost
Pointer chasing is a dependent load: the address of the next load depends on the result of the current load. The CPU cannot pipeline these loads. It must wait for each load to complete before issuing the next.
With an L1 cache latency of 1-2ns and an L3 latency of 10-15ns, a chain of N pointer dereferences costs N * (1 to 15)ns, depending on cache hit rate.
In the content platform, the recommendation engine walks a graph of related articles. Each article has a List<String> relatedIds, and each related article is fetched from a Map<String, Article>:
// SLOW: Pointer chasing through Map and List
public Set<Article> findRelated(Article root, Map<String, Article> index, int depth) {
Set<Article> visited = new HashSet<>();
Queue<Article> queue = new LinkedList<>(); // LinkedList: cache-hostile
queue.add(root);
for (int d = 0; d < depth; d++) {
int levelSize = queue.size();
for (int i = 0; i < levelSize; i++) {
Article current = queue.poll(); // Pointer chase: Node.item
if (visited.add(current)) {
for (String relId : current.relatedIds()) { // Pointer chase: List element
Article related = index.get(relId); // Pointer chase: HashMap node
if (related != null) {
queue.add(related);
}
}
}
}
}
return visited;
}
// FAST: Minimize pointer chasing with contiguous data
public Set<Article> findRelated(Article root, ArticleGraph graph, int depth) {
// ArticleGraph stores adjacency as int[][] (article index -> related indices)
// All indices are into a contiguous Article[] array
BitSet visited = new BitSet(graph.size());
int[] currentLevel = new int[]{graph.indexOf(root)};
int currentSize = 1;
Set<Article> result = new HashSet<>();
for (int d = 0; d < depth; d++) {
int[] nextLevel = new int[currentSize * 10]; // Overestimate
int nextSize = 0;
for (int i = 0; i < currentSize; i++) {
int idx = currentLevel[i];
if (!visited.get(idx)) {
visited.set(idx);
result.add(graph.articleAt(idx)); // Direct array access
int[] neighbors = graph.neighbors(idx); // Contiguous int array
for (int n : neighbors) {
if (!visited.get(n)) {
nextLevel[nextSize++] = n;
}
}
}
}
currentLevel = nextLevel;
currentSize = nextSize;
}
return result;
}
The fast version replaces HashMap lookups (3+ pointer chases per lookup) with integer-indexed array access (1 pointer dereference). The adjacency list uses int[] arrays instead of List<String>, eliminating String comparison and hash computation.
When Contiguous Memory Matters
The rule: if you iterate over a collection more than you insert into it, contiguous memory wins. The content platform’s workload is 99% reads, 1% writes. Every collection should be backed by arrays.
- Search results:
Article[], notList<Article> - View counts:
int[], notMap<String, AtomicInteger> - Feature vectors:
double[], notList<Double> - Adjacency lists:
int[][], notMap<Integer, List<Integer>>
The overhead of array-based storage is more code. You manage sizing, resizing, and null handling yourself. Java’s collections framework handles this for you with ArrayList, which is backed by an array. For most use cases, ArrayList provides 90% of the contiguous memory benefit with 100% of the convenience.
LinkedList has no valid use case in the content platform. Every LinkedList in the codebase should be replaced with ArrayDeque (for queue/deque operations) or ArrayList (for indexed access and iteration).