Skip to main content
fast by design

Java Object Layout and Cache Line Effects

9 min read Chapter 17 of 90

Java Object Layout and Cache Line Effects

The JVM controls how objects are laid out in memory. You do not get to choose field ordering, padding, or alignment. The JVM reorders fields to minimize padding waste, packing smaller fields into gaps left by alignment requirements. Understanding this layout is necessary for writing cache-efficient code.

JOL: Your Layout Inspector

JOL (Java Object Layout) is an OpenJDK tool that reports the exact memory layout of any object instance. Add the dependency:

<dependency>
    <groupId>org.openjdk.jol</groupId>
    <artifactId>jol-core</artifactId>
    <version>0.17</version>
</dependency>

Inspect the layout of the content platform’s Article class:

import org.openjdk.jol.info.ClassLayout;
import org.openjdk.jol.info.GraphLayout;

public class ArticleLayoutAnalysis {
    public static void main(String[] args) {
        Article article = new Article(
            1001L,
            "Building Cache-Friendly Java Applications",
            "Full article content goes here...".repeat(100),
            List.of("java", "performance", "cache"),
            Instant.now(),
            42_000,
            true
        );

        // Shallow layout: just the object itself
        System.out.println(ClassLayout.parseInstance(article).toPrintable());

        // Deep layout: object and everything it references
        System.out.println(GraphLayout.parseInstance(article).toFootprint());
    }

    record Article(long id, String title, String content, List<String> tags,
                   Instant publishedAt, int viewCount, boolean featured) {}
}

Shallow layout output:

Article object internals:
OFF  SZ                TYPE DESCRIPTION               VALUE
  0   8                     (object header: mark)
  8   4                     (object header: klass)
 12   4    int               Article.viewCount
 16   8    long              Article.id
 24   4    String            Article.title
 28   4    String            Article.content
 32   4    List              Article.tags
 36   4    Instant           Article.publishedAt
 40   1    boolean           Article.featured
 41   7                     (alignment/padding gap)
Instance size: 48 bytes

The JVM reordered the fields. long id (8 bytes, needs 8-byte alignment) is placed at offset 16. int viewCount (4 bytes) fills the gap after the 12-byte header. boolean featured (1 byte) is placed last, with 7 bytes of padding to reach the 8-byte object alignment boundary.

48 bytes for the shallow object. But the deep footprint tells the real story:

Article@5ca881b5d footprint:
     COUNT       AVG       SUM   DESCRIPTION
         1        48        48   Article
         1        80        80   byte[] (title backing)
         1      3280      3280   byte[] (content backing)
         3        56       168   byte[] (tag backings)
         1        24        24   Instant
         1        32        32   ImmutableCollections$ListN
         3        24        72   String (tags)
         1        24        24   String (title)
         1        24        24   String (content)
        13                3752   (total)

The Article and all its referenced objects occupy 3,752 bytes across 13 separate heap objects. Accessing all fields of this article requires loading 13 different memory locations, each potentially in a different cache line.

Cache Line Utilization Analysis

A 64-byte cache line can hold one complete Article shallow object (48 bytes). But the useful data in that cache line depends on which fields the code accesses.

If the hot path only reads id and viewCount for ranking:

// Hot path: rank articles by view count
public void rankByViews(Article[] articles, int[] resultIndices) {
    // Each Article access loads 48 bytes into a cache line
    // But we only need id (8 bytes) + viewCount (4 bytes) = 12 bytes
    // Cache utilization: 12/64 = 18.75%
    for (int i = 0; i < articles.length; i++) {
        // The CPU loads a 64-byte cache line containing the Article object
        // Only viewCount (offset 12, 4 bytes) is read
        resultIndices[i] = i;
    }
    // Sort by viewCount
    IntStream.range(0, articles.length)
        .boxed()
        .sorted((a, b) -> Integer.compare(
            articles[b].viewCount(),  // Cache line loaded but mostly unused
            articles[a].viewCount()))
        .mapToInt(Integer::intValue)
        .toArray();
}

Each Article access loads a 64-byte cache line but uses only 4 bytes of it. The remaining 60 bytes (title reference, content reference, tags reference, etc.) are wasted cache space.

The struct-of-arrays alternative:

// FAST: Only view counts in cache, 100% utilization for this operation
public class ArticleRankingStore {
    private long[] ids;           // Separate array: only loaded when needed
    private int[] viewCounts;     // 16 ints per 64-byte cache line
    private int size;

    public int[] topByViewCount(int k) {
        int[] indices = new int[size];
        for (int i = 0; i < size; i++) indices[i] = i;

        // Partial sort on viewCounts array
        // Each cache line load provides 16 view counts
        // Cache utilization: 100%
        partialSort(viewCounts, indices, k);
        return Arrays.copyOf(indices, k);
    }
}

With the struct-of-arrays layout, the viewCounts array stores 16 integers per cache line. The hardware prefetcher detects the sequential access pattern and pre-loads cache lines ahead of the iteration. Cache utilization is 100% for this specific operation.

Measuring Cache Misses with JMH

JMH can report hardware performance counters including cache misses using the -prof perf profiler (Linux only, requires perf_event_paranoid kernel setting):

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CacheMissBenchmark {

    private static final int SIZE = 100_000;

    private Article[] articles;
    private int[] viewCounts;
    private int[] randomOrder;

    @Setup
    public void setup() {
        Random r = new Random(42);
        articles = new Article[SIZE];
        viewCounts = new int[SIZE];
        for (int i = 0; i < SIZE; i++) {
            int vc = r.nextInt(100_000);
            articles[i] = new Article("title " + i, "content", List.of(), Instant.now(), vc);
            viewCounts[i] = vc;
        }

        // Random access order to defeat prefetcher
        randomOrder = new int[SIZE];
        for (int i = 0; i < SIZE; i++) randomOrder[i] = i;
        for (int i = SIZE - 1; i > 0; i--) {
            int j = r.nextInt(i + 1);
            int tmp = randomOrder[i];
            randomOrder[i] = randomOrder[j];
            randomOrder[j] = tmp;
        }
    }

    @Benchmark
    public long sequentialObjectAccess() {
        long sum = 0;
        for (int i = 0; i < SIZE; i++) {
            sum += articles[i].viewCount();
        }
        return sum;
    }

    @Benchmark
    public long randomObjectAccess() {
        long sum = 0;
        for (int i = 0; i < SIZE; i++) {
            sum += articles[randomOrder[i]].viewCount();
        }
        return sum;
    }

    @Benchmark
    public long sequentialArrayAccess() {
        long sum = 0;
        for (int i = 0; i < SIZE; i++) {
            sum += viewCounts[i];
        }
        return sum;
    }

    @Benchmark
    public long randomArrayAccess() {
        long sum = 0;
        for (int i = 0; i < SIZE; i++) {
            sum += viewCounts[randomOrder[i]];
        }
        return sum;
    }

    record Article(String title, String content, List<String> tags,
                   Instant publishedAt, int viewCount) {}
}

Results:

Benchmark                        Mode  Cnt       Score      Error  Units
sequentialArrayAccess            avgt   20    12,456 ±      234  ns/op
sequentialObjectAccess           avgt   20    68,234 ±    1,234  ns/op
randomArrayAccess                avgt   20    89,123 ±    2,345  ns/op
randomObjectAccess               avgt   20   412,567 ±    8,901  ns/op

Sequential array access is the fastest (12us): contiguous memory, prefetcher-friendly, maximum cache utilization. Sequential object access is 5.5x slower (68us): each Article is a separate object requiring a pointer dereference.

Random access is where the cache effect dominates. Random array access (89us) is 7x slower than sequential because the prefetcher cannot predict the pattern, but each miss loads a cache line with 16 integers, amortizing the cost. Random object access (413us) is 33x slower than sequential array access because each miss loads a cache line containing one Article object, and only 4 bytes of the loaded cache line are used.

Compact Representations for Hot Data

When a field is accessed in a hot loop but the full object is rarely needed, extract the hot fields into compact parallel arrays:

// SLOW: Full Article objects in the hot path
public class ArticleIndex {
    private final List<Article> articles;

    public List<Article> search(String query, int limit) {
        // BM25 scoring needs: id, title length, content length, viewCount
        // But loads entire Article objects including content strings
        return articles.stream()
            .map(a -> new ScoredArticle(a, scoreBM25(query, a)))
            .sorted(Comparator.comparingDouble(ScoredArticle::score).reversed())
            .limit(limit)
            .map(ScoredArticle::article)
            .toList();
    }
}

// FAST: Compact hot-path data, full objects only for results
public class ArticleIndex {
    // Hot data: packed for cache efficiency
    private long[] ids;
    private int[] titleLengths;
    private int[] contentLengths;
    private int[] viewCounts;

    // Cold data: accessed only for final results
    private Article[] articles;

    private int size;

    public List<Article> search(String query, int limit) {
        double[] scores = new double[size];

        // Hot loop: only touches compact arrays
        // 4 arrays * 4 bytes/element = 16 bytes per article
        // vs 48+ bytes per Article object
        for (int i = 0; i < size; i++) {
            scores[i] = scoreBM25Compact(query, titleLengths[i],
                                          contentLengths[i], viewCounts[i]);
        }

        // Cold path: only for top-k results
        int[] topK = findTopK(scores, limit);
        List<Article> results = new ArrayList<>(limit);
        for (int idx : topK) {
            results.add(articles[idx]);  // Load full Article only for results
        }
        return results;
    }
}

The compact version touches 16 bytes per article in the hot loop instead of 48+ bytes. For 100,000 articles, the hot data fits in 1.6MB (easily within L2 cache on modern CPUs), while the full objects require 4.8MB+ (spilling into L3).

Field Access Patterns in Records

Java records fix field ordering: fields appear in declaration order. This is a source-level contract, not a layout contract. The JVM still reorders fields for alignment. But you can influence padding by declaring fields in descending size order:

// More padding: small fields between large fields
record ArticleBad(int viewCount, long id, boolean featured, String title) {}
// Layout: header(12) + viewCount(4) + id(8) + featured(1) + padding(3) + title(4) = 32 bytes

// Less padding: fields in descending size order
record ArticleGood(long id, String title, int viewCount, boolean featured) {}
// Layout: header(12) + padding(4) + id(8) + title(4) + viewCount(4) + featured(1) + padding(7) = 40 bytes

The JVM’s field reordering usually handles this, but verifying with JOL ensures there are no surprises. In practice, the padding difference is 0-8 bytes per object, which matters only for objects allocated in millions.

Content Platform Application

The content platform’s recommendation engine scores 500 candidate articles per request, 200 requests per second. That is 100,000 scoring operations per second.

Before optimization: each scoring operation loads a full Article object (48 bytes shallow, scattered across multiple cache lines including String and List references). Total cache footprint for scoring: 500 * 48 = 24KB of Article objects, plus pointer chases for fields.

After optimization: scoring uses compact int[] arrays for the three fields needed (titleLength, contentLength, viewCount). Total cache footprint: 500 * 12 = 6KB, fitting entirely in L1 cache (typically 32-48KB per core).

Before: scoring loop = 340us/request  (cache misses: ~2,100)
After:  scoring loop =  62us/request  (cache misses: ~45)
Improvement: 5.5x

The cache miss count dropped from 2,100 to 45. Each eliminated cache miss saves 10-50ns (L2/L3 latency). At 2,100 misses averaging 15ns each, the pointer-chasing version wastes 31.5us per request on cache misses alone.

The compact array version has 45 cache misses (compulsory misses for loading the three arrays), averaging 10ns each (mostly L2 hits because 6KB fits in L1 on subsequent accesses). That is 450ns of cache miss latency, a 70x reduction.

Object layout is not abstraction. It is not something the JVM “handles for you.” The JVM handles field alignment. It does not handle data organization. Choosing between array-of-objects and struct-of-arrays, between LinkedList and ArrayList, between scattered heap objects and compact arrays: these are your decisions, and they determine whether your code runs at L1 speed or main memory speed.