Skip to main content
fast by design

Reading Execution Plans: Every Node Type Explained

12 min read Chapter 50 of 90

Reading Execution Plans: Every Node Type Explained

The main chapter covered the five most common node types. This section covers every node type you will encounter in production PostgreSQL execution plans, with content platform examples for each. When a query is slow, the execution plan tells you exactly why. But only if you can read it.

Scan Nodes

Scan nodes are the leaves of the execution plan tree. They produce rows from a single data source.

Seq Scan

A sequential scan reads every page of a table from first to last. PostgreSQL chooses Seq Scan when the query retrieves a significant fraction of the table or when no usable index exists.

Content platform example: count all published articles by month for an analytics dashboard.

EXPLAIN (ANALYZE, BUFFERS)
SELECT date_trunc('month', published_at) AS month, COUNT(*)
FROM articles
WHERE status = 'published'
GROUP BY date_trunc('month', published_at);
HashAggregate  (cost=20174.00..20176.50 rows=200 width=16)
  (actual time=112.441..112.512 rows=48 loops=1)
  Group Key: date_trunc('month', published_at)
  Batches: 1  Memory Usage: 40kB
  Buffers: shared hit=8924
  ->  Seq Scan on articles  (cost=0.00..19924.00 rows=487000 width=8)
        (actual time=0.012..52.112 rows=487000 loops=1)
        Filter: (status = 'published')
        Rows Removed by Filter: 13000
        Buffers: shared hit=8924
Planning Time: 0.121 ms
Execution Time: 112.581 ms

The Seq Scan is correct here. The query processes 487,000 of 500,000 rows (97.4%). An index scan would perform 487,000 random page reads, which is slower than 8,924 sequential page reads. The ratio that tips the balance depends on random_page_cost (default 4.0) and seq_page_cost (default 1.0). At the default 4:1 ratio, sequential scan wins when the query reads more than approximately 25% of the table.

Index Scan

An Index Scan traverses the B-tree index to find matching entries, then fetches the corresponding row from the heap (table) for each entry. Two I/O operations per row: one for the index leaf page and one for the heap page.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title, slug, published_at
FROM articles
WHERE slug = 'performance-engineering-java';
Index Scan using idx_articles_slug on articles
  (cost=0.42..8.44 rows=1 width=82)
  (actual time=0.028..0.029 rows=1 loops=1)
  Index Cond: (slug = 'performance-engineering-java')
  Buffers: shared hit=4

Four buffer hits: B-tree root (1), internal node (1), leaf node (1), heap page (1). The B-tree for 500,000 rows with a slug column averaging 40 bytes has 3 levels, so traversal touches 3 index pages plus 1 heap page.

Index Only Scan

An Index Only Scan reads data exclusively from the index without touching the heap, but only when two conditions are met:

  1. All columns in the SELECT list and WHERE clause are in the index.
  2. The visibility map confirms the page is all-visible (no recent uncommitted changes).
-- Covering index: CREATE INDEX idx_articles_status_count
--   ON articles (status, view_count);

EXPLAIN (ANALYZE, BUFFERS)
SELECT view_count
FROM articles
WHERE status = 'published'
  AND view_count > 10000;
Index Only Scan using idx_articles_status_count on articles
  (cost=0.42..842.33 rows=4200 width=4)
  (actual time=0.031..1.842 rows=4187 loops=1)
  Index Cond: ((status = 'published') AND (view_count > 10000))
  Heap Fetches: 0
  Buffers: shared hit=38

Heap Fetches: 0 confirms no heap access. Only 38 buffer hits to return 4,187 rows. Compare this with an Index Scan, which would need approximately 4,187 additional heap page reads. The visibility map is maintained by VACUUM. If VACUUM has not run recently, Heap Fetches will be non-zero because PostgreSQL must check the heap to determine row visibility.

Bitmap Index Scan + Bitmap Heap Scan

This two-phase scan builds a bitmap of pages from the index, then reads those pages in physical order. It combines the benefit of index selectivity with sequential I/O.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title, published_at
FROM articles
WHERE category_id = 7
  AND published_at > '2025-01-01';
Bitmap Heap Scan on articles  (cost=124.56..4842.33 rows=3200 width=82)
  (actual time=1.221..6.842 rows=3187 loops=1)
  Recheck Cond: (category_id = 7)
  Filter: (published_at > '2025-01-01')
  Rows Removed by Filter: 813
  Heap Blocks: exact=2841
  Buffers: shared hit=2867
  ->  Bitmap Index Scan on idx_articles_category_id
        (cost=0.00..123.76 rows=4000 width=0)
        (actual time=0.842..0.842 rows=4000 loops=1)
        Buffers: shared hit=26

Heap Blocks: exact=2841 means the bitmap tracked individual pages. When the bitmap exceeds work_mem, it becomes “lossy” and tracks blocks instead of tuples, requiring rechecking more rows:

Heap Blocks: exact=1024 lossy=1817

Lossy blocks cause additional heap reads because PostgreSQL must recheck every tuple in those blocks.

BitmapAnd and BitmapOr

When a query has multiple conditions, each with its own index, PostgreSQL can combine bitmaps:

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title
FROM articles
WHERE category_id = 7
  AND author_id = 42;
Bitmap Heap Scan on articles  (cost=52.18..842.33 rows=120 width=82)
  (actual time=0.412..0.621 rows=118 loops=1)
  Recheck Cond: ((category_id = 7) AND (author_id = 42))
  Heap Blocks: exact=112
  Buffers: shared hit=142
  ->  BitmapAnd  (cost=52.18..52.18 rows=120 width=0)
        (actual time=0.312..0.312 rows=0 loops=1)
        Buffers: shared hit=30
        ->  Bitmap Index Scan on idx_articles_category_id
              (cost=0.00..23.42 rows=4000 width=0)
              (actual time=0.142..0.142 rows=4000 loops=1)
              Buffers: shared hit=14
        ->  Bitmap Index Scan on idx_articles_author_id
              (cost=0.00..28.51 rows=1500 width=0)
              (actual time=0.151..0.151 rows=1500 loops=1)
              Buffers: shared hit=16

The BitmapAnd intersects two bitmaps (4,000 entries AND 1,500 entries) to produce 120 matches. This is more efficient than a single index scan with a filter when both conditions are moderately selective but neither is highly selective on its own.

Join Nodes

Join nodes combine rows from two input sets. The join strategy has the largest impact on query performance for multi-table queries.

Nested Loop

Nested Loop iterates through the outer input and, for each row, scans the inner input. It is optimal when the outer side is small and the inner side has an index.

EXPLAIN (ANALYZE, BUFFERS)
SELECT a.title, c.name
FROM articles a
JOIN categories c ON a.category_id = c.id
WHERE a.slug = 'performance-engineering-java';
Nested Loop  (cost=0.71..16.75 rows=1 width=118)
  (actual time=0.042..0.044 rows=1 loops=1)
  Buffers: shared hit=7
  ->  Index Scan using idx_articles_slug on articles a
        (cost=0.42..8.44 rows=1 width=86)
        (actual time=0.028..0.029 rows=1 loops=1)
        Index Cond: (slug = 'performance-engineering-java')
        Buffers: shared hit=4
  ->  Index Scan using categories_pkey on categories c
        (cost=0.29..0.31 rows=1 width=36)
        (actual time=0.011..0.011 rows=1 loops=1)
        Index Cond: (id = a.category_id)
        Buffers: shared hit=3

One outer row, one inner lookup. Total: 7 buffer hits. Nested Loop becomes dangerous when the outer side returns many rows:

-- Nested Loop with large outer: articles join view_events
EXPLAIN (ANALYZE, BUFFERS)
SELECT a.title, COUNT(ve.id)
FROM articles a
JOIN view_events ve ON ve.article_id = a.id
WHERE a.category_id = 7
GROUP BY a.title;
GroupAggregate  (cost=0.87..124847.33 rows=4000 width=90)
  (actual time=0.112..842.112 rows=4000 loops=1)
  Buffers: shared hit=412842
  ->  Nested Loop  (cost=0.87..118847.33 rows=1200000 width=86)
        (actual time=0.089..621.112 rows=1200000 loops=1)
        Buffers: shared hit=412842
        ->  Index Scan using idx_articles_category_id on articles a
              (cost=0.42..248.33 rows=4000 width=82)
              (actual time=0.031..2.112 rows=4000 loops=1)
              Buffers: shared hit=842
        ->  Index Scan using idx_view_events_article_id on view_events ve
              (cost=0.44..28.51 rows=300 width=12)
              (actual time=0.012..0.112 rows=300 loops=4000)
              Index Cond: (article_id = a.id)
              Buffers: shared hit=412000

loops=4000 on the inner scan means 4,000 index lookups, each reading approximately 103 buffer pages. The 412,000 buffer hits for the inner scan dominate. A Hash Join would be more efficient here because it reads the view_events table once rather than 4,000 times.

Hash Join

Hash Join builds a hash table from the smaller (inner) relation, then probes it with each row from the larger (outer) relation. Each relation is scanned exactly once.

EXPLAIN (ANALYZE, BUFFERS)
SELECT a.title, c.name
FROM articles a
JOIN categories c ON a.category_id = c.id
WHERE a.status = 'published';
Hash Join  (cost=295.00..20219.00 rows=487000 width=118)
  (actual time=2.891..98.221 rows=487000 loops=1)
  Hash Cond: (a.category_id = c.id)
  Buffers: shared hit=9042
  ->  Seq Scan on articles a  (cost=0.00..19924.00 rows=487000 width=86)
        (actual time=0.012..42.112 rows=487000 loops=1)
        Filter: (status = 'published')
        Rows Removed by Filter: 13000
        Buffers: shared hit=8924
  ->  Hash  (cost=170.00..170.00 rows=10000 width=36)
        (actual time=2.841..2.841 rows=10000 loops=1)
        Buckets: 16384  Batches: 1  Memory Usage: 597kB
        Buffers: shared hit=118
        ->  Seq Scan on categories c  (cost=0.00..170.00 rows=10000 width=36)
              (actual time=0.005..0.921 rows=10000 loops=1)
              Buffers: shared hit=118

Batches: 1 means the entire hash table fits in memory. Memory Usage: 597kB is well within the default work_mem of 4 MB. When the hash table exceeds work_mem:

Buckets: 16384  Batches: 4  Memory Usage: 4097kB

Batches: 4 means the data is split into 4 partitions and joined in 4 passes, reading each relation multiple times. This causes significant I/O. Increasing work_mem to fit the hash table in one batch eliminates the multi-pass overhead.

Merge Join

Merge Join requires both inputs sorted on the join key. It merges the two sorted streams in a single pass, advancing cursors simultaneously.

-- Force merge join for demonstration
SET enable_hashjoin = off;

EXPLAIN (ANALYZE, BUFFERS)
SELECT a.title, c.name
FROM articles a
JOIN categories c ON a.category_id = c.id
WHERE a.status = 'published';
Merge Join  (cost=42348.33..55842.33 rows=487000 width=118)
  (actual time=142.112..198.442 rows=487000 loops=1)
  Merge Cond: (a.category_id = c.id)
  Buffers: shared hit=9042, temp read=4821 temp written=4821
  ->  Sort  (cost=42124.33..43342.83 rows=487000 width=86)
        (actual time=138.112..152.442 rows=487000 loops=1)
        Sort Key: a.category_id
        Sort Method: external merge  Disk: 38568kB
        Buffers: shared hit=8924, temp read=4821 temp written=4821
        ->  Seq Scan on articles a  (cost=0.00..19924.00 rows=487000 width=86)
              (actual time=0.012..42.112 rows=487000 loops=1)
              ...
  ->  Sort  (cost=819.39..844.39 rows=10000 width=36)
        (actual time=2.112..2.842 rows=10000 loops=1)
        Sort Key: c.id
        Sort Method: quicksort  Memory: 1024kB
        Buffers: shared hit=118
        ->  Seq Scan on categories c  ...

The articles sort spills to disk (external merge, Disk: 38568kB) because 487,000 rows exceed work_mem. The Hash Join (98 ms) beats the Merge Join (198 ms) because it avoids sorting. Merge Join wins when both inputs are already sorted (for example, from an Index Scan on the join column).

Processing Nodes

Sort

The Sort node orders rows. It appears in ORDER BY, MERGE JOIN, and some DISTINCT/GROUP BY plans.

Sort  (cost=42124.33..43342.83 rows=487000 width=86)
  (actual time=92.112..108.442 rows=487000 loops=1)
  Sort Key: published_at DESC
  Sort Method: quicksort  Memory: 48210kB
  Buffers: shared hit=8924

Sort Method: quicksort Memory: 48210kB means the sort completed in memory using 48 MB. If this exceeds work_mem, it switches to external merge Disk: which is dramatically slower due to disk I/O for the sort runs.

Aggregate and GroupAggregate

Aggregate computes a single value from all rows (e.g., COUNT(*), SUM()). GroupAggregate computes values per group.

EXPLAIN (ANALYZE, BUFFERS)
SELECT category_id, COUNT(*), AVG(view_count)
FROM articles
WHERE status = 'published'
GROUP BY category_id;
HashAggregate  (cost=20674.00..20774.00 rows=100 width=20)
  (actual time=98.112..98.142 rows=100 loops=1)
  Group Key: category_id
  Batches: 1  Memory Usage: 40kB
  Buffers: shared hit=8924
  ->  Seq Scan on articles  (cost=0.00..19924.00 rows=487000 width=12)
        (actual time=0.012..42.112 rows=487000 loops=1)
        Filter: (status = 'published')
        Rows Removed by Filter: 13000
        Buffers: shared hit=8924

HashAggregate builds a hash table with one entry per group. With 100 categories, the hash table is tiny (40 KB). When the number of groups is large (millions), PostgreSQL may choose GroupAggregate with a prior Sort, or HashAggregate may spill to disk with Batches > 1.

Materialize

The Materialize node stores the output of a subplan in memory (or on disk) to allow re-reading. It appears in Nested Loop joins when the inner side would otherwise be re-executed:

Nested Loop  (cost=0.00..124.33 rows=1000 width=48)
  (actual time=0.031..1.221 rows=1000 loops=1)
  ->  Seq Scan on small_table  (cost=0.00..1.10 rows=10 width=12)
        (actual time=0.008..0.012 rows=10 loops=1)
  ->  Materialize  (cost=0.00..2.50 rows=100 width=36)
        (actual time=0.001..0.012 rows=100 loops=10)
        ->  Seq Scan on lookup_table  (cost=0.00..2.00 rows=100 width=36)
              (actual time=0.004..0.028 rows=100 loops=1)

Without Materialize, the lookup_table Seq Scan would execute 10 times (loops=10). With Materialize, it executes once (loops=1 on the inner Seq Scan) and the cached result is replayed 10 times.

Limit

The Limit node stops execution after producing the required number of rows. It can dramatically reduce work when combined with sorted input:

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title FROM articles
WHERE status = 'published'
ORDER BY published_at DESC LIMIT 20;
Limit  (cost=0.43..4.87 rows=20 width=82)
  (actual time=0.041..0.089 rows=20 loops=1)
  Buffers: shared hit=4
  ->  Index Scan Backward using idx_articles_published_at on articles a
        (cost=0.43..35124.00 rows=487000 width=82)
        (actual time=0.039..0.084 rows=20 loops=1)
        Filter: (status = 'published')
        Buffers: shared hit=4

The Index Scan has an estimated cost of 35,124 for all 487,000 rows, but Limit stops it after 20 rows. The actual cost is trivial: 4 buffer hits. Without the index on published_at, the database would need to sort all 487,000 rows before returning the top 20.

Reading Plans: The Diagnostic Process

When diagnosing a slow query:

  1. Find the highest-cost node: The node with the largest gap between estimated and actual time is the bottleneck.
  2. Check row estimates: If rows=1 estimated but rows=100000 actual, the planner made a bad choice. Run ANALYZE on the table to update statistics.
  3. Check for Seq Scan on large tables with small result sets: This usually means a missing index.
  4. Check for disk spills: Sort Method: external merge Disk: or Batches: N (N > 1) mean work_mem is insufficient for this query.
  5. Check Buffers: shared read (disk) vs shared hit (cache). High shared read means the working set exceeds shared_buffers.
  6. Check loops: loops=10000 on an inner scan means that scan executes 10,000 times. Multiply the per-loop time by loops to get the real cost.
  7. Check Rows Removed by Filter: A large number means the scan reads data the query does not need. A more selective index would reduce this.

These node types and their interactions form the vocabulary for every query performance conversation. The next section covers the ORM layer that generates the queries producing these plans.