Inlining, Devirtualization, and the Megamorphic Trap
Inlining, Devirtualization, and the Megamorphic Trap
A method call in Java costs between 2 and 25 nanoseconds depending on whether the JIT can inline it. The difference is not the call overhead itself (which is 3-5ns for a virtual call). The difference is what inlining enables: escape analysis, constant folding, loop-invariant code motion, and dead code elimination. When a call site goes megamorphic, all of these downstream optimizations are lost.
The Dispatch Cost Hierarchy
The JVM uses four dispatch mechanisms, each with different costs:
- Static dispatch (static, private, constructor calls): 0ns overhead. Direct call, always inlined.
- Monomorphic dispatch (one observed type at the call site): ~1ns overhead. Type guard + direct call, inlined.
- Bimorphic dispatch (two observed types): ~3ns overhead. Type guard + conditional branch, both paths inlined.
- Megamorphic dispatch (three or more types): ~10-15ns overhead. vtable/itable lookup, NOT inlined.
The cost of megamorphic dispatch is not just the vtable lookup. It is the loss of inlining. Without inlining, the JIT treats the callee as a black box. Objects passed to a megamorphic call site are assumed to escape. Allocations cannot be eliminated. Constants cannot be folded across the call boundary.
Benchmarking Call Site Polymorphism
This benchmark isolates the cost of call site polymorphism. Three implementations of a ContentScorer interface compute a score differently, but all do the same amount of work:
public interface ContentScorer {
long score(String content);
}
public final class LengthScorer implements ContentScorer {
@Override
public long score(String content) {
return content.length();
}
}
public final class HashScorer implements ContentScorer {
@Override
public long score(String content) {
return content.hashCode();
}
}
public final class WordCountScorer implements ContentScorer {
@Override
public long score(String content) {
long count = 0;
for (int i = 0; i < content.length(); i++) {
if (content.charAt(i) == ' ') count++;
}
return count + 1;
}
}
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class DispatchCostBenchmark {
private String content;
private ContentScorer monoScorer;
private ContentScorer[] biScorerArray;
private ContentScorer[] megaScorerArray;
private int index;
@Setup
public void setup() {
content = "Performance engineering requires measurement not opinion ".repeat(20);
monoScorer = new LengthScorer();
biScorerArray = new ContentScorer[]{
new LengthScorer(),
new HashScorer()
};
megaScorerArray = new ContentScorer[]{
new LengthScorer(),
new HashScorer(),
new WordCountScorer()
};
index = 0;
}
@Benchmark
public long monomorphic() {
return monoScorer.score(content);
}
@Benchmark
public long bimorphic() {
ContentScorer scorer = biScorerArray[index++ & 1];
return scorer.score(content);
}
@Benchmark
public long megamorphic() {
ContentScorer scorer = megaScorerArray[index++ % 3];
return scorer.score(content);
}
}
Results:
Benchmark Mode Cnt Score Error Units
DispatchCostBenchmark.monomorphic avgt 30 2.14 ± 0.08 ns/op
DispatchCostBenchmark.bimorphic avgt 30 7.89 ± 0.23 ns/op
DispatchCostBenchmark.megamorphic avgt 30 24.67 ± 0.45 ns/op
Monomorphic: 2.14ns. The call is devirtualized and inlined. LengthScorer.score() becomes a direct field access on the String object. The JIT sees through the interface call completely.
Bimorphic: 7.89ns. The JIT inserts a type guard: if (scorer instanceof LengthScorer) then inline LengthScorer path, else inline HashScorer path. Both paths are inlined, but the branch costs ~3ns and the guard check costs ~2ns.
Megamorphic: 24.67ns. The JIT gives up on inlining. Each call goes through the itable (interface method table). The itable lookup requires: load object header -> load klass pointer -> scan itable for interface -> load method pointer -> indirect call. This chain of dependent loads is 15-20ns on modern hardware.
The 12x gap between monomorphic and megamorphic is not the call overhead. It is the lost optimization. LengthScorer.score() compiles to a single instruction when inlined. Through the itable, it requires a full method call with stack frame setup and teardown.
Type Profile Pollution
Type profiles are per-call-site, not per-method. A call site that sees three types is megamorphic even if each type appears in different code paths that never execute together.
// SLOW: Initialization pollutes the type profile
public class ArticleService {
private final ContentScorer scorer;
public ArticleService(ContentScorer scorer) {
this.scorer = scorer;
}
public void initialize() {
// Warm up with test data using different scorer types
List<ContentScorer> testScorers = List.of(
new LengthScorer(),
new HashScorer(),
new WordCountScorer()
);
for (ContentScorer s : testScorers) {
s.score("test"); // This call site records 3 types
}
}
public long scoreArticle(String content) {
return scorer.score(content); // Different call site, unaffected
}
}
In this example, the initialize() method pollutes its own call site, but scoreArticle() has a separate call site that remains monomorphic. Type profile pollution happens when the same call site in the same method sees multiple types.
The dangerous pattern:
// SLOW: Single call site sees all processor types -> megamorphic
public void processAll(List<ContentProcessor> processors, Article article) {
for (ContentProcessor p : processors) {
p.process(article); // One call site, N types -> megamorphic if N >= 3
}
}
// FAST: Separate the hot path from the cold path
public void processAll(List<ContentProcessor> processors, Article article) {
for (ContentProcessor p : processors) {
if (p instanceof HtmlProcessor hp) {
hp.process(article); // Monomorphic call site
} else if (p instanceof JsonProcessor jp) {
jp.process(article); // Monomorphic call site
} else {
p.process(article); // Megamorphic, but cold path
}
}
}
The fast version uses instanceof pattern matching to create separate monomorphic call sites for the common types. The JIT inlines each concrete call. The fallback p.process() is megamorphic but rarely executed.
This pattern is ugly. It breaks the open-closed principle. It couples the caller to concrete types. But in a hot loop processing millions of articles per second, the 12x performance difference justifies the trade-off.
Inlining Depth Limits
The JVM limits inline nesting to -XX:MaxInlineLevel=15. This sounds generous, but framework code consumes inline depth quickly.
Consider a Spring-based content serving path:
Level 1: DispatcherServlet.doDispatch()
Level 2: HandlerAdapter.handle()
Level 3: RequestMappingHandlerAdapter.invokeHandlerMethod()
Level 4: ServletInvocableHandlerMethod.invokeAndHandle()
Level 5: InvocableHandlerMethod.invokeForRequest()
Level 6: InvocableHandlerMethod.doInvoke()
Level 7: ArticleController.getArticle() <- YOUR CODE
Level 8: ArticleService.findById()
Level 9: ArticleCache.get()
Level 10: ConcurrentHashMap.get()
Level 11: ConcurrentHashMap.tabAt()
Your business logic starts at level 7. By level 11, the inline chain reaches into ConcurrentHashMap internals. With the default limit of 15, there are only 4 levels remaining for any deeper calls.
If ArticleService.findById() calls a recommendation engine that calls a scoring function that calls a utility method, the deepest methods in the chain are not inlined. Those methods cannot benefit from escape analysis.
// SLOW: Deep call chain, inlining stops before the inner loop
public List<Article> getRecommendations(Article article) {
return recommendationEngine // Level 8
.rank( // Level 9
candidateSelector // Level 10
.select( // Level 11
featureExtractor // Level 12
.extract( // Level 13
article // Level 14
)
)
);
}
// FAST: Flattened call chain, all code inlinable
public List<Article> getRecommendations(Article article) {
double[] features = extractFeatures(article); // Level 8
List<Article> candidates = selectCandidates(features); // Level 9
return rankCandidates(candidates, features); // Level 10
}
The flattened version reaches level 10. Each method is small enough to inline. The JIT sees the entire recommendation pipeline in a single compilation unit.
Interface vs Abstract Class Dispatch
Interface dispatch (itable) is slower than class dispatch (vtable) because the itable requires a linear scan to find the correct interface method entry:
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class InterfaceVsClassDispatch {
interface Scorer { long score(String s); }
abstract static class AbstractScorer { abstract long score(String s); }
static final class InterfaceImpl implements Scorer {
@Override public long score(String s) { return s.length(); }
}
static final class AbstractImpl extends AbstractScorer {
@Override long score(String s) { return s.length(); }
}
private Scorer interfaceRef;
private AbstractScorer abstractRef;
private String content;
@Setup
public void setup() {
interfaceRef = new InterfaceImpl();
abstractRef = new AbstractImpl();
content = "Performance test content for dispatch benchmark";
}
@Benchmark
public long interfaceCall() {
return interfaceRef.score(content);
}
@Benchmark
public long abstractCall() {
return abstractRef.score(content);
}
}
With monomorphic call sites, both are inlined and the difference vanishes (both ~2ns). The difference appears only at the megamorphic level:
Benchmark Mode Cnt Score Error Units
InterfaceVsClassDispatch.interfaceCall avgt 30 2.12 ± 0.06 ns/op (mono)
InterfaceVsClassDispatch.abstractCall avgt 30 2.08 ± 0.05 ns/op (mono)
# With megamorphic pollution (3+ implementations):
InterfaceVsClassDispatch.interfaceCall avgt 30 26.34 ± 0.89 ns/op (mega)
InterfaceVsClassDispatch.abstractCall avgt 30 21.78 ± 0.72 ns/op (mega)
At the megamorphic level, vtable dispatch (abstract class) is ~18% faster than itable dispatch (interface) because vtable entries are at fixed offsets, while itable entries require a search.
This does not mean you should use abstract classes instead of interfaces. Use interfaces for design clarity. But if a hot-path interface call goes megamorphic, understand that the itable scan adds to the cost.
Diagnosing Inlining in the Content Platform
The content platform’s article serving hot path processes 200 requests per second. Each request touches the article cache, the recommendation engine, and the response serializer.
To check inlining behavior:
java -XX:+UnlockDiagnosticVMOptions \
-XX:+PrintInlining \
-XX:+PrintCompilation \
-jar content-platform.jar 2>&1 | grep "ArticleService"
Output:
@ 12 ArticleCache::get (18 bytes) inline (hot)
@ 23 Article::getTitle (5 bytes) accessor
@ 28 Article::getContent (5 bytes) accessor
@ 45 RecommendationEngine::rank (412 bytes) hot method too big
@ 67 ResponseBuilder::build (234 bytes) inline (hot)
RecommendationEngine::rank is 412 bytecodes, exceeding FreqInlineSize=325. It is not inlined. Every object allocated inside rank() is heap-allocated because escape analysis cannot see across the call boundary.
The fix: break rank() into smaller methods. Extract the scoring loop, the sorting step, and the truncation step into separate methods, each under 325 bytecodes. The JIT inlines each piece, and escape analysis can eliminate the intermediate Score objects.
// SLOW: Single large method, not inlinable
public List<Article> rank(List<Article> candidates, double[] features) {
// 412 bytecodes: scoring + sorting + truncation all in one method
List<Score> scores = new ArrayList<>();
for (Article a : candidates) {
double score = 0.0;
for (int i = 0; i < features.length; i++) {
score += features[i] * getFeature(a, i);
}
scores.add(new Score(a, score)); // Cannot be scalar-replaced
}
scores.sort(Comparator.comparingDouble(Score::value).reversed());
return scores.stream().limit(10).map(Score::article).toList();
}
// FAST: Decomposed into inlinable methods
public List<Article> rank(List<Article> candidates, double[] features) {
double[] scores = computeScores(candidates, features); // Inlined
int[] topIndices = findTopK(scores, 10); // Inlined
return gatherResults(candidates, topIndices); // Inlined
}
private double[] computeScores(List<Article> candidates, double[] features) {
double[] scores = new double[candidates.size()];
for (int i = 0; i < candidates.size(); i++) {
scores[i] = dotProduct(features, candidates.get(i));
}
return scores;
}
The decomposed version uses primitive arrays instead of boxed Score objects, eliminating allocation entirely. Each method is under 100 bytecodes, trivially inlinable. The JIT compiles the entire rank() call chain into a single native method with no allocations.
The 12x gap between monomorphic and megamorphic dispatch is the single largest JIT optimization effect you can control. Keep your hot call sites monomorphic. Keep your hot methods small. The JIT will do the rest.