Object Allocation and Memory Layout: Cache Lines, False Sharing, and the Cost of Pointer Chasing

The CPU does not access memory. The CPU accesses cache. When the data it needs is in L1 cache, it gets the data in 1-2 nanoseconds. When the data is in main memory, it waits 50-100 nanoseconds. That 50x penalty determines which data structures are fast and which are slow, regardless of their algorithmic complexity.

Java’s object model creates a tension. Objects have headers. Fields have alignment requirements. References add indirection. A LinkedList is O(1) for insertion, but each node access is a cache miss, making iteration 14x slower than ArrayList for the same data.

The content platform iterates over article lists, view count arrays, and recommendation scores millions of times per second. The difference between a cache-friendly layout and a cache-hostile layout is measured in response time.

Java Object Headers

Every Java object starts with a header. On a 64-bit JVM with compressed oops (the default for heaps under 32GB), the header is 12 bytes:

Mark word (8 bytes): Hash code, GC age, lock state, biased locking info
Klass pointer (4 bytes, compressed): Pointer to the class metadata

The JVM pads the total object size to a multiple of 8 bytes. A Point object with two int fields:

class Point { int x; int y; }

Layout:

Offset  Size  Field
0       8     Mark word
8       4     Klass pointer (compressed)
12      4     int x
16      4     int y
20      4     Padding (to align to 8 bytes)
Total: 24 bytes

24 bytes to store 8 bytes of data. The overhead ratio is 3:1. For small objects, the header dominates the memory footprint.

An array of 1000 Point objects occupies: 1000 * 24 bytes (objects) + 16 bytes (array header) + 1000 * 4 bytes (references) = 28,016 bytes. Only 8,000 bytes are actual data. The rest is headers, padding, and references.

Measuring Object Layout with JOL

The JOL (Java Object Layout) tool shows the exact memory layout of any object:

// Add JOL dependency: org.openjdk.jol:jol-core
import org.openjdk.jol.info.ClassLayout;

public class LayoutInspector {
    public static void main(String[] args) {
        System.out.println(ClassLayout.parseInstance(new Article(
            "Test Title",
            "Test content for the article",
            List.of("java"),
            Instant.now(),
            42
        )).toPrintable());
    }

    record Article(String title, String content, List<String> tags,
                   Instant publishedAt, int viewCount) {}
}

Output (with compressed oops):

Article object internals:
OFF  SZ                TYPE DESCRIPTION               VALUE
  0   8                     (object header: mark)
  8   4                     (object header: klass)
 12   4    int               Article.viewCount
 16   4    String            Article.title             (object ref)
 20   4    String            Article.content           (object ref)
 24   4    List              Article.tags              (object ref)
 28   4    Instant           Article.publishedAt       (object ref)
Instance size: 32 bytes
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total

32 bytes for the Article object itself. But this object contains four references to other objects. The String title points to a separate 24-byte String object, which points to a separate byte[] array. Following all the references to compute the deep size:

Article: 32 bytes
title String: 24 bytes + byte[] (title.length + 16 header)
content String: 24 bytes + byte[] (content.length + 16 header)
tags List: 24 bytes + Object[] (tags.length * 4 + 16 header) + tag Strings
publishedAt Instant: 24 bytes

A single Article with a 1KB content body occupies approximately 1.2KB spread across 8+ separate heap objects. Iterating over the Article and accessing all its fields requires following 5+ pointers, each potentially a cache miss.

Cache Lines and Access Patterns

Modern x86 CPUs use 64-byte cache lines. When the CPU accesses a byte of memory, it loads the entire 64-byte cache line containing that byte into L1 cache. Subsequent accesses to other bytes in the same cache line are free.

Memory Layout: ArrayList vs LinkedList with Cache Line Boundaries

This diagram shows three key memory layout concepts. First, an ArrayList’s backing array stores compressed object references contiguously: 12 references fit in a single 64-byte cache line, making sequential scans prefetcher-friendly with only 1 cache miss per 12 references. Second, a LinkedList’s nodes are scattered across the heap, each 48-byte node in a different cache line, requiring a pointer chase (dependent load) per element. Third, the false sharing problem where two threads writing to different fields in the same cache line cause constant MESI invalidation traffic, and the fix using @Contended annotation to add 128 bytes of padding between the fields.

ArrayList vs LinkedList: The Cache Reality

The algorithmic complexity argument (ArrayList O(n) remove vs LinkedList O(1) remove) is irrelevant for iteration-dominated workloads. The cache behavior dominates.

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ListIterationBenchmark {

    @Param({"1000", "10000", "100000"})
    private int size;

    private List<Long> arrayList;
    private List<Long> linkedList;

    @Setup
    public void setup() {
        arrayList = new ArrayList<>(size);
        linkedList = new LinkedList<>();
        Random r = new Random(42);
        for (int i = 0; i < size; i++) {
            long val = r.nextLong();
            arrayList.add(val);
            linkedList.add(val);
        }
    }

    @Benchmark
    public long sumArrayList() {
        long sum = 0;
        for (Long val : arrayList) {
            sum += val;
        }
        return sum;
    }

    @Benchmark
    public long sumLinkedList() {
        long sum = 0;
        for (Long val : linkedList) {
            sum += val;
        }
        return sum;
    }
}

Results:

Benchmark                     (size)  Mode  Cnt        Score      Error  Units
sumArrayList                    1000  avgt   30      330.2 ±      8.1  ns/op
sumLinkedList                   1000  avgt   30     4,287.5 ±    112.4  ns/op
sumArrayList                   10000  avgt   30    3,124.8 ±     45.2  ns/op
sumLinkedList                  10000  avgt   30   58,342.1 ±  1,234.5  ns/op
sumArrayList                  100000  avgt   30   31,456.2 ±    567.8  ns/op
sumLinkedList                 100000  avgt   30  824,125.3 ± 15,432.1  ns/op

LinkedList iteration is 13-26x slower than ArrayList iteration. The ratio increases with size because larger data sets exceed L2 and L3 cache, making each pointer chase more expensive.

For the content platform, this means: never use LinkedList for any collection that is iterated. Article lists, search results, recommendation lists, tag collections: all ArrayList or arrays.

Struct of Arrays vs Array of Structs

The Java object model encourages the “array of structs” pattern: List<Article> where each Article is a separate heap object. Iterating over a single field (e.g., summing view counts) loads entire Article objects into cache, wasting cache line space on unused fields.

The “struct of arrays” pattern stores each field in its own array. Iterating over one field loads only that field’s data, maximizing cache utilization.

// SLOW: Array of structs (standard Java pattern)
// Iterating view counts loads title, content, tags, publishedAt into cache
public long totalViewCounts(List<Article> articles) {
    long total = 0;
    for (Article a : articles) {
        total += a.viewCount();  // Loads entire Article object (32+ bytes) for 4 bytes of data
    }
    return total;
}

// FAST: Struct of arrays (cache-friendly for single-field access)
public class ArticleStore {
    private String[] titles;
    private String[] contents;
    private int[] viewCounts;      // Contiguous int array
    private long[] publishedAts;
    private int size;

    public long totalViewCounts() {
        long total = 0;
        for (int i = 0; i < size; i++) {
            total += viewCounts[i];  // 16 ints per cache line, sequential access
        }
        return total;
    }
}

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class StructOfArraysBenchmark {

    private static final int SIZE = 100_000;
    private Article[] articles;
    private int[] viewCounts;

    @Setup
    public void setup() {
        Random r = new Random(42);
        articles = new Article[SIZE];
        viewCounts = new int[SIZE];
        for (int i = 0; i < SIZE; i++) {
            int vc = r.nextInt(100_000);
            articles[i] = new Article("title", "content", List.of(), Instant.now(), vc);
            viewCounts[i] = vc;
        }
    }

    @Benchmark
    public long sumArrayOfStructs() {
        long total = 0;
        for (int i = 0; i < SIZE; i++) {
            total += articles[i].viewCount();
        }
        return total;
    }

    @Benchmark
    public long sumStructOfArrays() {
        long total = 0;
        for (int i = 0; i < SIZE; i++) {
            total += viewCounts[i];
        }
        return total;
    }

    record Article(String title, String content, List<String> tags,
                   Instant publishedAt, int viewCount) {}
}

Results:

Benchmark                                 Mode  Cnt      Score     Error  Units
sumArrayOfStructs                         avgt   30  68,234.1 ±  1,234  ns/op
sumStructOfArrays                         avgt   30  12,456.2 ±    234  ns/op

The struct-of-arrays version is 5.5x faster. The int[] array stores 16 integers per cache line. Sequential iteration hits the hardware prefetcher perfectly: the CPU predicts the access pattern and pre-loads the next cache line before the code requests it.

The array-of-structs version chases a reference for each element. Each articles[i] is a pointer load, and the Article object may be anywhere in the heap. Even if the Article objects were allocated sequentially (TLAB allocation makes this likely for new objects), they are 32 bytes each, so only 2 fit per cache line. And the viewCount field is at offset 12, wasting the first 12 bytes of each loaded cache line.

Pointer Chasing: The Real Cost

Pointer chasing is a dependent load: the address of the next load depends on the result of the current load. The CPU cannot pipeline these loads. It must wait for each load to complete before issuing the next.

With an L1 cache latency of 1-2ns and an L3 latency of 10-15ns, a chain of N pointer dereferences costs N * (1 to 15)ns, depending on cache hit rate.

In the content platform, the recommendation engine walks a graph of related articles. Each article has a List<String> relatedIds, and each related article is fetched from a Map<String, Article>:

// SLOW: Pointer chasing through Map and List
public Set<Article> findRelated(Article root, Map<String, Article> index, int depth) {
    Set<Article> visited = new HashSet<>();
    Queue<Article> queue = new LinkedList<>();  // LinkedList: cache-hostile
    queue.add(root);

    for (int d = 0; d < depth; d++) {
        int levelSize = queue.size();
        for (int i = 0; i < levelSize; i++) {
            Article current = queue.poll();        // Pointer chase: Node.item
            if (visited.add(current)) {
                for (String relId : current.relatedIds()) {  // Pointer chase: List element
                    Article related = index.get(relId);      // Pointer chase: HashMap node
                    if (related != null) {
                        queue.add(related);
                    }
                }
            }
        }
    }
    return visited;
}

// FAST: Minimize pointer chasing with contiguous data
public Set<Article> findRelated(Article root, ArticleGraph graph, int depth) {
    // ArticleGraph stores adjacency as int[][] (article index -> related indices)
    // All indices are into a contiguous Article[] array
    BitSet visited = new BitSet(graph.size());
    int[] currentLevel = new int[]{graph.indexOf(root)};
    int currentSize = 1;
    Set<Article> result = new HashSet<>();

    for (int d = 0; d < depth; d++) {
        int[] nextLevel = new int[currentSize * 10];  // Overestimate
        int nextSize = 0;
        for (int i = 0; i < currentSize; i++) {
            int idx = currentLevel[i];
            if (!visited.get(idx)) {
                visited.set(idx);
                result.add(graph.articleAt(idx));  // Direct array access
                int[] neighbors = graph.neighbors(idx);  // Contiguous int array
                for (int n : neighbors) {
                    if (!visited.get(n)) {
                        nextLevel[nextSize++] = n;
                    }
                }
            }
        }
        currentLevel = nextLevel;
        currentSize = nextSize;
    }
    return result;
}

The fast version replaces HashMap lookups (3+ pointer chases per lookup) with integer-indexed array access (1 pointer dereference). The adjacency list uses int[] arrays instead of List<String>, eliminating String comparison and hash computation.

When Contiguous Memory Matters

The rule: if you iterate over a collection more than you insert into it, contiguous memory wins. The content platform’s workload is 99% reads, 1% writes. Every collection should be backed by arrays.

Search results: Article[], not List<Article>
View counts: int[], not Map<String, AtomicInteger>
Feature vectors: double[], not List<Double>
Adjacency lists: int[][], not Map<Integer, List<Integer>>

The overhead of array-based storage is more code. You manage sizing, resizing, and null handling yourself. Java’s collections framework handles this for you with ArrayList, which is backed by an array. For most use cases, ArrayList provides 90% of the contiguous memory benefit with 100% of the convenience.

LinkedList has no valid use case in the content platform. Every LinkedList in the codebase should be replaced with ArrayDeque (for queue/deque operations) or ArrayList (for indexed access and iteration).