Java Object Layout and Cache Line Effects
Java Object Layout and Cache Line Effects
The JVM controls how objects are laid out in memory. You do not get to choose field ordering, padding, or alignment. The JVM reorders fields to minimize padding waste, packing smaller fields into gaps left by alignment requirements. Understanding this layout is necessary for writing cache-efficient code.
JOL: Your Layout Inspector
JOL (Java Object Layout) is an OpenJDK tool that reports the exact memory layout of any object instance. Add the dependency:
<dependency>
<groupId>org.openjdk.jol</groupId>
<artifactId>jol-core</artifactId>
<version>0.17</version>
</dependency>
Inspect the layout of the content platform’s Article class:
import org.openjdk.jol.info.ClassLayout;
import org.openjdk.jol.info.GraphLayout;
public class ArticleLayoutAnalysis {
public static void main(String[] args) {
Article article = new Article(
1001L,
"Building Cache-Friendly Java Applications",
"Full article content goes here...".repeat(100),
List.of("java", "performance", "cache"),
Instant.now(),
42_000,
true
);
// Shallow layout: just the object itself
System.out.println(ClassLayout.parseInstance(article).toPrintable());
// Deep layout: object and everything it references
System.out.println(GraphLayout.parseInstance(article).toFootprint());
}
record Article(long id, String title, String content, List<String> tags,
Instant publishedAt, int viewCount, boolean featured) {}
}
Shallow layout output:
Article object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark)
8 4 (object header: klass)
12 4 int Article.viewCount
16 8 long Article.id
24 4 String Article.title
28 4 String Article.content
32 4 List Article.tags
36 4 Instant Article.publishedAt
40 1 boolean Article.featured
41 7 (alignment/padding gap)
Instance size: 48 bytes
The JVM reordered the fields. long id (8 bytes, needs 8-byte alignment) is placed at offset 16. int viewCount (4 bytes) fills the gap after the 12-byte header. boolean featured (1 byte) is placed last, with 7 bytes of padding to reach the 8-byte object alignment boundary.
48 bytes for the shallow object. But the deep footprint tells the real story:
Article@5ca881b5d footprint:
COUNT AVG SUM DESCRIPTION
1 48 48 Article
1 80 80 byte[] (title backing)
1 3280 3280 byte[] (content backing)
3 56 168 byte[] (tag backings)
1 24 24 Instant
1 32 32 ImmutableCollections$ListN
3 24 72 String (tags)
1 24 24 String (title)
1 24 24 String (content)
13 3752 (total)
The Article and all its referenced objects occupy 3,752 bytes across 13 separate heap objects. Accessing all fields of this article requires loading 13 different memory locations, each potentially in a different cache line.
Cache Line Utilization Analysis
A 64-byte cache line can hold one complete Article shallow object (48 bytes). But the useful data in that cache line depends on which fields the code accesses.
If the hot path only reads id and viewCount for ranking:
// Hot path: rank articles by view count
public void rankByViews(Article[] articles, int[] resultIndices) {
// Each Article access loads 48 bytes into a cache line
// But we only need id (8 bytes) + viewCount (4 bytes) = 12 bytes
// Cache utilization: 12/64 = 18.75%
for (int i = 0; i < articles.length; i++) {
// The CPU loads a 64-byte cache line containing the Article object
// Only viewCount (offset 12, 4 bytes) is read
resultIndices[i] = i;
}
// Sort by viewCount
IntStream.range(0, articles.length)
.boxed()
.sorted((a, b) -> Integer.compare(
articles[b].viewCount(), // Cache line loaded but mostly unused
articles[a].viewCount()))
.mapToInt(Integer::intValue)
.toArray();
}
Each Article access loads a 64-byte cache line but uses only 4 bytes of it. The remaining 60 bytes (title reference, content reference, tags reference, etc.) are wasted cache space.
The struct-of-arrays alternative:
// FAST: Only view counts in cache, 100% utilization for this operation
public class ArticleRankingStore {
private long[] ids; // Separate array: only loaded when needed
private int[] viewCounts; // 16 ints per 64-byte cache line
private int size;
public int[] topByViewCount(int k) {
int[] indices = new int[size];
for (int i = 0; i < size; i++) indices[i] = i;
// Partial sort on viewCounts array
// Each cache line load provides 16 view counts
// Cache utilization: 100%
partialSort(viewCounts, indices, k);
return Arrays.copyOf(indices, k);
}
}
With the struct-of-arrays layout, the viewCounts array stores 16 integers per cache line. The hardware prefetcher detects the sequential access pattern and pre-loads cache lines ahead of the iteration. Cache utilization is 100% for this specific operation.
Measuring Cache Misses with JMH
JMH can report hardware performance counters including cache misses using the -prof perf profiler (Linux only, requires perf_event_paranoid kernel setting):
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CacheMissBenchmark {
private static final int SIZE = 100_000;
private Article[] articles;
private int[] viewCounts;
private int[] randomOrder;
@Setup
public void setup() {
Random r = new Random(42);
articles = new Article[SIZE];
viewCounts = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
int vc = r.nextInt(100_000);
articles[i] = new Article("title " + i, "content", List.of(), Instant.now(), vc);
viewCounts[i] = vc;
}
// Random access order to defeat prefetcher
randomOrder = new int[SIZE];
for (int i = 0; i < SIZE; i++) randomOrder[i] = i;
for (int i = SIZE - 1; i > 0; i--) {
int j = r.nextInt(i + 1);
int tmp = randomOrder[i];
randomOrder[i] = randomOrder[j];
randomOrder[j] = tmp;
}
}
@Benchmark
public long sequentialObjectAccess() {
long sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += articles[i].viewCount();
}
return sum;
}
@Benchmark
public long randomObjectAccess() {
long sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += articles[randomOrder[i]].viewCount();
}
return sum;
}
@Benchmark
public long sequentialArrayAccess() {
long sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += viewCounts[i];
}
return sum;
}
@Benchmark
public long randomArrayAccess() {
long sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += viewCounts[randomOrder[i]];
}
return sum;
}
record Article(String title, String content, List<String> tags,
Instant publishedAt, int viewCount) {}
}
Results:
Benchmark Mode Cnt Score Error Units
sequentialArrayAccess avgt 20 12,456 ± 234 ns/op
sequentialObjectAccess avgt 20 68,234 ± 1,234 ns/op
randomArrayAccess avgt 20 89,123 ± 2,345 ns/op
randomObjectAccess avgt 20 412,567 ± 8,901 ns/op
Sequential array access is the fastest (12us): contiguous memory, prefetcher-friendly, maximum cache utilization. Sequential object access is 5.5x slower (68us): each Article is a separate object requiring a pointer dereference.
Random access is where the cache effect dominates. Random array access (89us) is 7x slower than sequential because the prefetcher cannot predict the pattern, but each miss loads a cache line with 16 integers, amortizing the cost. Random object access (413us) is 33x slower than sequential array access because each miss loads a cache line containing one Article object, and only 4 bytes of the loaded cache line are used.
Compact Representations for Hot Data
When a field is accessed in a hot loop but the full object is rarely needed, extract the hot fields into compact parallel arrays:
// SLOW: Full Article objects in the hot path
public class ArticleIndex {
private final List<Article> articles;
public List<Article> search(String query, int limit) {
// BM25 scoring needs: id, title length, content length, viewCount
// But loads entire Article objects including content strings
return articles.stream()
.map(a -> new ScoredArticle(a, scoreBM25(query, a)))
.sorted(Comparator.comparingDouble(ScoredArticle::score).reversed())
.limit(limit)
.map(ScoredArticle::article)
.toList();
}
}
// FAST: Compact hot-path data, full objects only for results
public class ArticleIndex {
// Hot data: packed for cache efficiency
private long[] ids;
private int[] titleLengths;
private int[] contentLengths;
private int[] viewCounts;
// Cold data: accessed only for final results
private Article[] articles;
private int size;
public List<Article> search(String query, int limit) {
double[] scores = new double[size];
// Hot loop: only touches compact arrays
// 4 arrays * 4 bytes/element = 16 bytes per article
// vs 48+ bytes per Article object
for (int i = 0; i < size; i++) {
scores[i] = scoreBM25Compact(query, titleLengths[i],
contentLengths[i], viewCounts[i]);
}
// Cold path: only for top-k results
int[] topK = findTopK(scores, limit);
List<Article> results = new ArrayList<>(limit);
for (int idx : topK) {
results.add(articles[idx]); // Load full Article only for results
}
return results;
}
}
The compact version touches 16 bytes per article in the hot loop instead of 48+ bytes. For 100,000 articles, the hot data fits in 1.6MB (easily within L2 cache on modern CPUs), while the full objects require 4.8MB+ (spilling into L3).
Field Access Patterns in Records
Java records fix field ordering: fields appear in declaration order. This is a source-level contract, not a layout contract. The JVM still reorders fields for alignment. But you can influence padding by declaring fields in descending size order:
// More padding: small fields between large fields
record ArticleBad(int viewCount, long id, boolean featured, String title) {}
// Layout: header(12) + viewCount(4) + id(8) + featured(1) + padding(3) + title(4) = 32 bytes
// Less padding: fields in descending size order
record ArticleGood(long id, String title, int viewCount, boolean featured) {}
// Layout: header(12) + padding(4) + id(8) + title(4) + viewCount(4) + featured(1) + padding(7) = 40 bytes
The JVM’s field reordering usually handles this, but verifying with JOL ensures there are no surprises. In practice, the padding difference is 0-8 bytes per object, which matters only for objects allocated in millions.
Content Platform Application
The content platform’s recommendation engine scores 500 candidate articles per request, 200 requests per second. That is 100,000 scoring operations per second.
Before optimization: each scoring operation loads a full Article object (48 bytes shallow, scattered across multiple cache lines including String and List references). Total cache footprint for scoring: 500 * 48 = 24KB of Article objects, plus pointer chases for fields.
After optimization: scoring uses compact int[] arrays for the three fields needed (titleLength, contentLength, viewCount). Total cache footprint: 500 * 12 = 6KB, fitting entirely in L1 cache (typically 32-48KB per core).
Before: scoring loop = 340us/request (cache misses: ~2,100)
After: scoring loop = 62us/request (cache misses: ~45)
Improvement: 5.5x
The cache miss count dropped from 2,100 to 45. Each eliminated cache miss saves 10-50ns (L2/L3 latency). At 2,100 misses averaging 15ns each, the pointer-chasing version wastes 31.5us per request on cache misses alone.
The compact array version has 45 cache misses (compulsory misses for loading the three arrays), averaging 10ns each (mostly L2 hits because 6KB fits in L1 on subsequent accesses). That is 450ns of cache miss latency, a 70x reduction.
Object layout is not abstraction. It is not something the JVM “handles for you.” The JVM handles field alignment. It does not handle data organization. Choosing between array-of-objects and struct-of-arrays, between LinkedList and ArrayList, between scattered heap objects and compact arrays: these are your decisions, and they determine whether your code runs at L1 speed or main memory speed.