Skip to main content
fast by design

MVCC Mechanics and Visibility

11 min read Chapter 62 of 90

MVCC Mechanics and Visibility

The main chapter introduced MVCC at a high level: each UPDATE creates a new tuple, the old one becomes dead. This section goes deeper into how PostgreSQL determines which tuples are visible to which transactions, the cost of those visibility checks, and the data structures that accelerate them.

Tuple Header Layout

Every heap tuple (row stored on disk) starts with a 23-byte header:

Offset  Size  Field           Description
0       4     t_xmin          Transaction ID that inserted this tuple
4       4     t_xmax          Transaction ID that deleted/updated this tuple (0 if live)
8       4     t_cid           Command ID within the inserting transaction
12      4     t_ctid.ip_blkid Block number of next tuple version
16      2     t_ctid.ip_posid Offset within block of next tuple version
18      2     t_infomask2     Number of attributes + flags
20      2     t_infomask      Visibility flags
22      1     t_hoff          Offset to user data (header size including padding)

The t_infomask field contains critical performance flags:

Bit     Name                        Meaning
0x0001  HEAP_HASNULL               Has null attributes
0x0002  HEAP_HASVARWIDTH           Has variable-width attributes
0x0010  HEAP_XMIN_COMMITTED        xmin transaction is known committed
0x0020  HEAP_XMIN_INVALID          xmin transaction is known aborted
0x0040  HEAP_XMAX_COMMITTED        xmax transaction is known committed
0x0080  HEAP_XMAX_INVALID          xmax transaction is known aborted/unused
0x2000  HEAP_HOT_UPDATED           Tuple was HOT-updated
0x4000  HEAP_ONLY_TUPLE            Tuple is a HOT chain member (no index entry)
0x8000  HEAP_UPDATED               Tuple was created by UPDATE (not INSERT)

Hint Bits: Avoiding Repeated CLOG Lookups

When a tuple is first accessed after a transaction commits, PostgreSQL must determine if t_xmin and t_xmax are committed. This requires a lookup in pg_xact (the commit log, stored in pg_xact/ directory):

First access to tuple after xmin commits:
1. Read t_xmin from tuple header
2. Check pg_xact for transaction status: COMMITTED, ABORTED, or IN_PROGRESS
3. If COMMITTED: set HEAP_XMIN_COMMITTED hint bit in t_infomask
4. Write the updated page (dirty)

Second access to same tuple:
1. Read t_xmin from tuple header
2. Check HEAP_XMIN_COMMITTED bit in t_infomask: set!
3. Skip pg_xact lookup (fast path)

The first access after a commit is expensive (CLOG lookup). Subsequent accesses are fast (bit check). This is why the first sequential scan after a bulk load is slower than subsequent scans: it sets hint bits for every tuple.

Measuring the hint bit effect:

-- Load 1 million rows, then scan twice
COPY articles FROM '/tmp/articles.csv' WITH (FORMAT csv);

-- First scan: sets hint bits
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM articles;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=47847.00..47847.01 rows=1 width=8)
            (actual time=847.2..847.2 rows=1 loops=1)
   ->  Seq Scan on articles  (cost=0.00..45347.00 rows=1000000 width=0)
                             (actual time=0.02..612.8 rows=1000000 loops=1)
         Buffers: shared hit=18472 read=24891 dirtied=24891
 Planning Time: 0.08 ms
 Execution Time: 847.3 ms

Note dirtied=24891: the scan dirtied 24,891 pages by setting hint bits. These dirty pages must eventually be written to disk by the background writer.

-- Second scan: hint bits already set
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM articles;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=47847.00..47847.01 rows=1 width=8)
            (actual time=312.4..312.4 rows=1 loops=1)
   ->  Seq Scan on articles  (cost=0.00..45347.00 rows=1000000 width=0)
                             (actual time=0.01..218.7 rows=1000000 loops=1)
         Buffers: shared hit=43363
 Planning Time: 0.06 ms
 Execution Time: 312.5 ms

No reads (all in shared_buffers from first scan), no dirty pages. Execution time dropped from 847ms to 312ms. The 2.7x improvement comes from skipping CLOG lookups and reading warm buffers.

Snapshot Isolation: How Transactions See Data

A snapshot in PostgreSQL is:

typedef struct SnapshotData {
    TransactionId xmin;       // lowest still-running XID at snapshot time
    TransactionId xmax;       // first unassigned XID at snapshot time
    TransactionId *xip;       // array of in-progress XIDs between xmin and xmax
    uint32        xcnt;       // count of in-progress XIDs
} SnapshotData;

A tuple with t_xmin = X is visible if:

  1. X < snapshot.xmin (committed before any active transaction)
  2. OR X is between snapshot.xmin and snapshot.xmax AND X is NOT in snapshot.xip (committed, not in-progress)
  3. AND t_xmax is either 0 or not-yet-committed from this snapshot’s perspective

The cost of visibility checking scales with xcnt (number of concurrent transactions). With 200 concurrent transactions, each visibility check must search a 200-element array. PostgreSQL uses a sorted array with binary search, so the cost is O(log N).

Measuring visibility check overhead under concurrency:

-- 1 concurrent transaction: scan 1M rows
-- 200 concurrent transactions: scan same 1M rows

-- Simulated via pgbench with idle transactions holding snapshots
Concurrent Transactions | Seq Scan Time (1M rows) | Overhead
1                       |          312 ms         |  baseline
10                      |          314 ms         |   +0.6%
50                      |          318 ms         |   +1.9%
200                     |          329 ms         |   +5.4%
500                     |          352 ms         |   +12.8%

The overhead is measurable but modest for typical concurrency. At 500 concurrent transactions, each tuple visibility check adds approximately 12 nanoseconds (binary search through 500-element array).

The Commit Log (pg_xact)

Transaction status is stored in pg_xact/ directory as fixed-size files. Each transaction occupies 2 bits:

Status values (2 bits per transaction):
00 = IN_PROGRESS
01 = COMMITTED
10 = ABORTED
11 = SUB_COMMITTED (subtransaction committed, parent may still be in-progress)

Each pg_xact page (8KB) stores status for 32,768 transactions. PostgreSQL caches recently-accessed CLOG pages in shared memory (128 pages by default = status for 4 million transactions).

When the CLOG cache misses:

-- Monitor CLOG access
SELECT * FROM pg_stat_slru WHERE name = 'Xact';
 name | blks_zeroed | blks_hit | blks_read | blks_written | blks_exists | flushes | truncates
------+-------------+----------+-----------+--------------+-------------+---------+----------
 Xact |          12 |  2847291 |        47 |           12 |         847 |      12 |         2

blks_hit = 2,847,291 vs blks_read = 47: the CLOG cache hit rate is 99.998%. Cache misses happen only for very old transactions that have aged out of the cache. This confirms that hint bits handle the common case (recent transactions) and CLOG lookups handle the rare case (first access or very old XIDs).

Dead Tuple Visibility and Sequential Scans

Dead tuples remain on heap pages until vacuum removes them. During a sequential scan, every dead tuple is:

  1. Read from disk/buffer (I/O cost)
  2. Checked for visibility (CPU cost)
  3. Skipped (no result contribution)

The cost of dead tuples on a sequential scan:

-- Table with 100,000 live rows and 400,000 dead rows (80% dead)
-- Compare scan performance

-- Before vacuum:
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM article_view_counts;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=8472.00..8472.01 rows=1 width=8)
            (actual time=142.8..142.8 rows=1 loops=1)
   ->  Seq Scan on article_view_counts  (cost=0.00..7972.00 rows=100000 width=0)
                                        (actual time=0.02..128.4 rows=100000 loops=1)
         Buffers: shared hit=5847
 Planning Time: 0.04 ms
 Execution Time: 142.9 ms
-- After vacuum removes dead tuples:
VACUUM article_view_counts;

EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM article_view_counts;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=1847.00..1847.01 rows=1 width=8)
            (actual time=28.4..28.4 rows=1 loops=1)
   ->  Seq Scan on article_view_counts  (cost=0.00..1597.00 rows=100000 width=0)
                                        (actual time=0.01..21.7 rows=100000 loops=1)
         Buffers: shared hit=1284
 Planning Time: 0.03 ms
 Execution Time: 28.5 ms

5,847 buffers vs 1,284 buffers. 142.9ms vs 28.5ms. Dead tuples made the scan read 4.6x more pages. Note that regular VACUUM does not shrink the table file (pages are marked as reusable, not returned to OS), but it does allow PostgreSQL to skip empty pages in future scans.

The Visibility Map

The visibility map tracks which pages contain only tuples visible to all transactions. One bit per page:

Visibility Map bit = 1: all tuples on this page are visible to everyone
Visibility Map bit = 0: page may contain dead tuples or tuples from in-progress transactions

The visibility map serves two purposes:

1. Index Only Scans

An Index Only Scan returns data from the index without visiting the heap, but only if the visibility map confirms all tuples on the referenced page are visible:

-- With visibility map bit set for the page:
-- Index Only Scan: read index leaf page -> return data (no heap access)

-- With visibility map bit not set:
-- Index Only Scan: read index leaf page -> read heap page (visibility check) -> return data
-- Monitor visibility map effectiveness
SELECT
    relname,
    heap_blks_read + heap_blks_hit AS heap_fetches,
    idx_blks_read + idx_blks_hit AS idx_fetches
FROM pg_statio_user_tables
WHERE relname = 'articles';

2. Vacuum Skipping

Vacuum skips pages whose visibility map bit is set (no dead tuples possible on that page):

-- Aggressive vacuum still processes all pages:
VACUUM (VERBOSE) article_view_counts;
INFO:  vacuuming "public.article_view_counts"
INFO:  scanned index "article_view_counts_pkey" to remove 1847 row versions
INFO:  table "article_view_counts": removed 1847 dead item identifiers in 42 pages
INFO:  table "article_view_counts": found 1847 removable, 100000 nonremovable row versions
       in 1284 out of 1284 pages
DETAIL:  0 dead row versions cannot be removed yet.
         CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s

Vacuum processed 1,284 pages but only modified 42 (those containing dead tuples). The visibility map told vacuum to skip the other 1,242 pages.

Tuple Freezing

Transaction IDs are 32-bit unsigned integers, wrapping around at 2^32 (approximately 4 billion). To prevent transaction ID wraparound, vacuum “freezes” old tuples by setting their t_xmin to a special value (FrozenTransactionId = 2) indicating they are visible to all transactions regardless of XID comparison:

SHOW vacuum_freeze_min_age;    -- 50,000,000 (tuples older than 50M transactions get frozen)
SHOW autovacuum_freeze_max_age; -- 200,000,000 (force vacuum if any tuple is this old)

Monitoring freeze progress:

SELECT
    relname,
    age(relfrozenxid) AS xid_age,
    pg_size_pretty(pg_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;
       relname         | xid_age  |   size
-----------------------+----------+----------
 article_view_counts   | 47829184 |  142 MB
 view_events           | 28471924 |  1.2 GB
 articles              |  8472918 |  4.8 GB

If xid_age approaches autovacuum_freeze_max_age (200 million), PostgreSQL forces an aggressive anti-wraparound vacuum that scans the entire table. For the 4.8 GB articles table, this freeze vacuum takes minutes and causes I/O contention. Keeping xid_age well below the threshold via regular vacuum prevents these emergency freezes.

HOT Chain Mechanics

When conditions for a HOT update are met (no indexed column modified, free space on same page):

Before UPDATE (page has free space):
  Offset 3: [xmin=100, xmax=0, ctid=(5,3), article_id=42, count=100]
  
After UPDATE (HOT):
  Offset 3: [xmin=100, xmax=200, ctid=(5,4), flags=HOT_UPDATED, article_id=42, count=100]
  Offset 4: [xmin=200, xmax=0, ctid=(5,4), flags=HEAP_ONLY, article_id=42, count=101]

The index still points to offset 3. When an index scan finds offset 3, PostgreSQL follows the HOT chain:

  1. Read tuple at offset 3: xmax is set, not visible
  2. Follow ctid to offset 4: visible, return this tuple

This chain following is fast (same page, no additional I/O) but adds CPU cost proportional to chain length.

HOT Chain Pruning

PostgreSQL prunes HOT chains during page access (when it visits a page for any reason):

-- After HOT chain pruning:
  Offset 3: [DEAD - space reclaimed for reuse on this page]
  Offset 4: [xmin=200, xmax=0, ctid=(5,4), article_id=42, count=101]
  -- Line pointer 3 redirects to offset 4

Pruning happens without vacuum, triggered by any page access when dead HOT tuples are present. This is why high-frequency UPDATE tables benefit from fillfactor < 100: it keeps HOT chains short by ensuring space for new versions on the same page, and frequent access ensures prompt pruning.

Performance Implications Summary

OperationCostNotes
Visibility check (hint bits set)~5 nsInteger comparison + bit check
Visibility check (CLOG lookup)~200 nsFirst access after commit
Sequential scan, dead tuple skipSame as live tuple scanRead page + check visibility
HOT chain follow~50 ns per hopSame page, no I/O
Visibility map check~10 nsBitmap lookup
Index Only Scan (VM bit set)No heap accessFast path
Index Only Scan (VM bit clear)Heap page accessFalls back to index scan cost

For the content platform with 1000 updates/sec on counter tables:

  • Without HOT + aggressive vacuum: 1000 dead tuples/sec, table bloats indefinitely
  • With HOT (fillfactor=70) + aggressive vacuum: dead tuples pruned on access, vacuum catches remainder within 1 second, table stays at 1.02x ideal size