Skip to main content
the invisible-layer how abstraction is making software engineers dumber

The Flamegraph: Your Performance X-Ray

9 min read Chapter 33 of 56
Summary

Explains what flamegraphs are, how to generate them...

Explains what flamegraphs are, how to generate them across Python, Java, Go, and Linux-native tooling, walks through interpreting a realistic flamegraph to find a JSON serialization bottleneck, and covers memory profiling and the observer effect in production profiling.

The Flamegraph: Your Performance X-Ray

Brendan Gregg created the flamegraph in 2011 while investigating a MySQL performance problem at Joyent. He had perf profiles with thousands of stack samples, but the text output was unreadable. He wrote a Perl script to convert stack traces into an SVG visualization where the x-axis represents time proportion and the y-axis represents call stack depth — and accidentally created the most important performance visualization in the history of software.

Before flamegraphs, understanding where CPU time went required reading profiler output that looked like accounting spreadsheets — flat lists of function names with percentages. You could see that processRequest() consumed 45% of CPU, but you couldn’t see why. Was it the function itself, or something it called, or something called three layers deep? Flamegraphs made the answer visual and immediate.

What a Flamegraph Actually Shows

A flamegraph is a visualization of sampled call stacks. The profiler interrupts the running program at fixed intervals (typically 99 times per second, or once every ~10ms) and records the current call stack — which function is running, which function called it, which function called that, all the way up to the entry point.

After collecting thousands of these samples, the tool merges identical stacks and renders them as a stacked bar chart:

  • Each bar is a function. Its width is proportional to the number of samples where that function appeared in the stack — either running directly or as an ancestor of the running function.
  • The y-axis is stack depth. The bottom bar is the entry point (usually main() or the thread’s start function). Each bar above it is a function called by the bar below.
  • Width equals time. A function that appears in 60% of samples takes a bar that’s 60% of the total width. This is the critical insight: wide bars are where the time goes.
  • The x-axis is not time. This confuses everyone at first. The x-axis is alphabetically sorted, not chronologically ordered. Adjacent bars at the same level are siblings or different code paths, not sequential operations. The only dimension that matters is width.
  • Color is typically arbitrary — random warm tones to distinguish adjacent bars. Some tools use color to encode type (red for CPU, blue for I/O, green for runtime), but the default is just visual differentiation.

A plateau — a wide bar that doesn’t narrow as you go up the stack — means a single function is directly consuming that CPU time. It’s doing computation, not delegating. This is a hot function.

A tower — a narrow column many bars deep — means a code path with many layers of indirection but little total time. Deep stacks aren’t inherently bad; they’re only problematic if they’re also wide.

A wide bar that narrows into many children means a function that delegates to many different code paths. The parent aggregates time from many callees. To optimize, you’d investigate each child independently.

Generating Flamegraphs by Language

Python: py-spy

py-spy is a sampling profiler for Python that attaches to a running process without any code changes. It’s written in Rust, so it has negligible overhead:

# Profile a running process
py-spy record -o profile.svg --pid 12345

# Profile a command from start
py-spy record -o profile.svg -- python my_service.py

# Top-like real-time view
py-spy top --pid 12345

py-spy handles the GIL correctly — it samples even when Python is blocked on I/O or waiting for the GIL. This means its flamegraphs accurately distinguish CPU-bound time (in Python functions) from I/O-bound time (in native calls like select() or recv()).

For CPU profiling specifically, use --native to include C extension call stacks, which reveals whether your bottleneck is in Python code or in a C library:

py-spy record --native -o profile.svg --pid 12345

Java: async-profiler

async-profiler is the gold standard for JVM profiling. Unlike the built-in JVM profiler (which suffers from safepoint bias — it can only sample at JVM safepoints, missing work between them), async-profiler uses the perf_events kernel mechanism to sample actual CPU usage:

# Download and attach to a running JVM
./asprof -d 30 -f profile.html <pid>

# CPU profiling with kernel stacks (sees native code too)
./asprof -d 30 -e cpu -f profile.svg --cstack fp <pid>

# Allocation profiling (what's creating objects?)
./asprof -d 30 -e alloc -f alloc_profile.svg <pid>

The allocation profiling mode is particularly valuable. Instead of sampling CPU time, it samples object allocations. The flamegraph then shows which code paths are creating the most objects — directly revealing code that’s likely to cause GC pressure.

Go: pprof

Go has profiling built into the standard library. For any HTTP service, add a single import:

import _ "net/http/pprof"

Then collect and visualize profiles:

# CPU profile for 30 seconds
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# Heap profile (current allocations)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

# Goroutine profile (what are all goroutines doing?)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine

The -http=:8080 flag opens an interactive web UI with flamegraph, graph, and source views. Go’s pprof is genuinely the most ergonomic profiling experience in any language.

Linux: perf + FlameGraph

For any language, or when you need to include kernel-level stacks, use Linux’s perf:

# Record CPU samples for 30 seconds
perf record -g -p <pid> -- sleep 30

# Convert to flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg

The stackcollapse-perf.pl and flamegraph.pl scripts are from Brendan Gregg’s FlameGraph repository. This pipeline gives you the most complete view: user-space stacks, kernel stacks, and the transitions between them. If you suspect the bottleneck is in a system call or kernel path, this is the only tool that shows it.

Reading a Flamegraph: A Walkthrough

Here’s a realistic scenario. You profile a web API endpoint that returns product recommendations. The average response time is 280ms. The team assumes the machine learning inference step is the bottleneck. The flamegraph tells a different story:

[100%] handle_request()
├── [8%]  parse_request()
├── [12%] load_user_profile()
│   ├── [3%]  db_query()
│   └── [9%]  deserialize_profile()
├── [14%] run_ml_inference()
│   ├── [11%] model.predict()
│   └── [3%]  feature_extraction()
├── [61%] serialize_response()
│   ├── [42%] json.dumps()
│   │   └── [38%] _encoder.encode()
│   │       └── [35%] _make_iterencode.<locals>._iterencode()
│   └── [19%] format_products()
│       ├── [12%] _resolve_image_urls()
│       └── [7%]  _compute_display_price()
└── [5%]  send_response()

The ML inference — the part the team worried about — is 14% of CPU time. The JSON serialization is 61%. Nearly two-thirds of every request’s CPU budget is spent converting Python objects to a JSON string.

Drilling deeper: within serialization, 42% is json.dumps() itself — the standard library JSON encoder using reflection-based encoding. And 19% is format_products(), which includes resolving image URLs (12%) that involves string concatenation and path manipulation for each of the 50 product images in the response.

The optimization targets, ranked by impact:

  1. json.dumps() — 42%. Switch to orjson, which serializes 5-10x faster than the standard library by using compiled Rust code instead of Python reflection. Or use Pydantic v2’s compiled serialization.

  2. _resolve_image_urls() — 12%. The image URL resolution is doing string formatting inside a loop. Pre-compute the URL template with the CDN prefix once, then apply per-image IDs. Or better: move URL resolution to the client and return image IDs only, reducing both computation and response size.

  3. deserialize_profile() — 9%. User profiles are deserialized from JSON on every request. If a user makes multiple requests in a session, cache the deserialized object.

  4. model.predict() — 11%. The ML inference is already well-optimized. Leave it alone unless all the above are fixed and it becomes the new dominant cost.

After implementing fixes 1 and 2, the new flamegraph shows serialize_response() dropping from 61% to 18%, and overall response time drops from 280ms to 94ms. The same instances now handle 3x the traffic.

Memory Profiling

CPU flamegraphs show where time goes. Memory profilers show where bytes go. Different question, equally important.

Python: tracemalloc

import tracemalloc
tracemalloc.start()

# ... do work ...

snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('lineno')[:10]:
    print(stat)

This shows which lines of code allocated the most memory that’s still alive. For leak detection, take two snapshots and compare:

snapshot1 = tracemalloc.take_snapshot()
# ... do more work ...
snapshot2 = tracemalloc.take_snapshot()
for stat in snapshot2.compare_to(snapshot1, 'lineno')[:10]:
    print(stat)

Java: jmap + Eclipse MAT

# Dump the heap
jmap -dump:live,format=b,file=heap.hprof <pid>

Open heap.hprof in Eclipse Memory Analyzer (MAT). The “Leak Suspects” report automatically identifies objects that dominate the heap. The “Dominator Tree” shows the hierarchy of object retention — which object is keeping which other objects alive.

C/C++: heaptrack

heaptrack ./my_program
heaptrack_gui heaptrack.my_program.<pid>.gz

Heaptrack produces flamegraphs of allocation sites, showing both total allocations and peak memory. It’s the modern replacement for Valgrind’s massif tool, with dramatically lower overhead.

Development vs. Production Profiling

Profiling in development is safe but often misleading. Your development machine has different hardware, different load patterns, different data sizes, and different concurrency levels. A function that’s fast with 10 items might be quadratic and catastrophic with 10,000 items. Development profiling catches gross inefficiencies, but production is where the real bottlenecks emerge.

Production profiling introduces the observer effect: the act of measuring changes what you’re measuring. A CPU profiler that interrupts the process 99 times per second adds roughly 2-5% overhead. A memory profiler that hooks every allocation can add 10-50% overhead. A tracing profiler that records every function entry and exit can add 100%+ overhead.

The solution is sampling profilers with low overhead. py-spy, async-profiler, and Go’s pprof are all designed for production use. They sample, not trace — they capture a statistical snapshot of behavior rather than recording every event. The accuracy is proportional to the sampling duration: 30 seconds of sampling at 99Hz gives you 2,970 samples, which is sufficient to identify any function consuming more than ~1% of CPU time.

Run production profiling for 30-60 seconds during representative load, download the flamegraph, analyze offline. The overhead is negligible, the insight is invaluable, and you’ll discover performance truths that no amount of code reading can reveal.

The flamegraph is not just a visualization. It’s a worldview. It says: “Don’t guess where the time goes. Measure.” And once you’ve measured, the path to optimization is clear — not always easy, but clear. The widest bar is where you start. Everything else is distraction.