Cache Lines and the Memory Wall

The Speed of Forgetting

Your CPU can execute billions of instructions per second. It cannot read from main memory at anything close to that rate. This gap — the memory wall — has been the dominant performance bottleneck in computing since the 1990s, and it grows wider every year.

Here are the actual access latencies for a modern x86-64 processor:

CPU cache hierarchy pyramid showing L1, L2, L3 cache, RAM, and disk latency comparison with 64-byte cache line detail

CPU memory hierarchy pyramid: L1 cache (~1–2 ns, 32–64 KB per core) is the fastest but smallest — a cache miss here costs 4–10× more than a hit. L2 cache (~3–5 ns, 256 KB per core) acts as the primary spillover. L3 cache (~10–20 ns, 8–32 MB shared across cores) is where cross-core data sharing costs appear. Main memory RAM (~50–100 ns, DDR4/DDR5) is 50× slower than L1. The 64-byte cache line means the CPU always loads 64 contiguous bytes at once — so accessing one element of a struct pre-loads its neighbors (spatial locality). Code that traverses arrays linearly is cache-friendly; code that follows pointer chains is not. This single diagram explains why Python’s list of objects is structurally slower than a C array of primitives: each object access is a pointer dereference that may cause a separate cache miss.

Level	Typical size	Latency	Relative speed
L1 cache	32-64 KB per core	~1 ns	1x
L2 cache	256 KB-1 MB per core	~4 ns	4x slower
L3 cache	8-64 MB shared	~10 ns	10x slower
Main memory (RAM)	16-256 GB	~100 ns	100x slower
SSD	256 GB-4 TB	~10,000 ns	10,000x slower

An L1 cache hit takes about the same time as a single integer addition. A cache miss that goes to RAM takes the same time as one hundred additions. If your code misses the cache on every memory access, you’re running at 1% of your CPU’s potential throughput.

The CPU compensates with a cache hierarchy — small, fast memory layers between the processor and RAM. Data moves between these layers in fixed-size chunks called cache lines.

Cache Lines: 64 Bytes of Consequence

The CPU never reads a single byte from memory. It reads a cache line — 64 bytes on virtually all modern x86 and ARM processors.

When you access array[0], the CPU fetches bytes 0 through 63 into L1 cache. If array[1] falls within those 64 bytes — and for a 4-byte int array, it does — the next access is free. You get 16 consecutive integer accesses for the price of one cache miss.

This is why sequential array iteration is outrageously fast. The hardware prefetcher detects the linear access pattern and starts loading the next cache line before you ask for it. By the time you reach the end of one cache line, the next is already in L1.

But when you follow a pointer to a heap-allocated object at an arbitrary address, the prefetcher can’t help. Each pointer chase is likely a cache miss. 64 bytes are fetched, you use 8 of them (the pointer or value you wanted), and the rest are wasted.

Cache lines create a subtle multi-threading hazard. If two threads write to different variables that happen to live on the same 64-byte cache line, the CPU’s cache coherency protocol forces each core to invalidate and reload that cache line on every write by the other thread. This is called false sharing — the variables are logically independent, but physically coupled.

#include <stdio.h>
#include <pthread.h>
#include <time.h>

#define ITERATIONS 100000000

// BAD: Both counters on the same cache line
struct shared_bad {
    long counter_a;  // Bytes 0-7
    long counter_b;  // Bytes 8-15 — same 64-byte cache line
};

// GOOD: Counters on separate cache lines
struct shared_good {
    long counter_a;
    char _pad[56];   // Push counter_b to the next cache line
    long counter_b;
};

void *increment_a(void *arg) {
    long *counter = (long *)arg;
    for (long i = 0; i < ITERATIONS; i++) {
        (*counter)++;
    }
    return NULL;
}

void *increment_b(void *arg) {
    long *counter = (long *)arg;
    for (long i = 0; i < ITERATIONS; i++) {
        (*counter)++;
    }
    return NULL;
}

double benchmark(long *counter_a, long *counter_b) {
    pthread_t t1, t2;
    struct timespec start, end;

    *counter_a = 0;
    *counter_b = 0;

    clock_gettime(CLOCK_MONOTONIC, &start);
    pthread_create(&t1, NULL, increment_a, counter_a);
    pthread_create(&t2, NULL, increment_b, counter_b);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    clock_gettime(CLOCK_MONOTONIC, &end);

    return (end.tv_sec - start.tv_sec) +
           (end.tv_nsec - start.tv_nsec) / 1e9;
}

int main() {
    struct shared_bad bad = {0, 0};
    struct shared_good good = {0, {0}, 0};

    double t_bad = benchmark(&bad.counter_a, &bad.counter_b);
    double t_good = benchmark(&good.counter_a, &good.counter_b);

    printf("False sharing (same cache line):      %.3fs\n", t_bad);
    printf("No false sharing (separate lines):    %.3fs\n", t_good);
    printf("Speedup from fixing false sharing:    %.1fx\n", t_bad / t_good);
    return 0;
}

Compile and run:

gcc -O2 -pthread -o false_sharing false_sharing.c
./false_sharing

Typical results:

False sharing (same cache line):      1.847s
No false sharing (separate lines):    0.312s
Speedup from fixing false sharing:    5.9x

Adding 56 bytes of padding — wasting memory deliberately — gives you a 6x speedup. The cache coherency protocol (MESI or MOESI on x86) forces each core to broadcast invalidation messages and reload the cache line from a shared level of cache on every write. With separate cache lines, each core writes independently to its own cached copy.

Java’s @Contended annotation and C#‘s [FieldOffset] exist specifically to solve this problem. The fact that language designers added dedicated features for cache line alignment tells you how real this issue is.

Array of Structs vs Struct of Arrays

Data layout determines cache efficiency. Consider a particle simulation with a million particles:

Array of Structs (AoS): The intuitive layout.

struct Particle {
    float x, y, z;      // Position: 12 bytes
    float vx, vy, vz;   // Velocity: 12 bytes
    float mass;          // Mass: 4 bytes
    int type;            // Type: 4 bytes
};                       // Total: 32 bytes per particle

struct Particle particles[1000000];

Struct of Arrays (SoA): The cache-friendly layout.

struct ParticleSystem {
    float x[1000000];
    float y[1000000];
    float z[1000000];
    float vx[1000000];
    float vy[1000000];
    float vz[1000000];
    float mass[1000000];
    int type[1000000];
};

struct ParticleSystem particles;

If your update function only touches position and velocity — x += vx * dt — watch what happens with each layout:

AoS: Read particle[0] — load 32 bytes into cache. Use x (4 bytes) and vx (4 bytes). The other 24 bytes (y, z, vy, vz, mass, type) are wasted cache space. Move to particle[1] — maybe a cache miss, since each particle is 32 bytes and only two fit per cache line.

SoA: Read x[0] — load 64 bytes into cache. That’s x[0] through x[15] — sixteen x-values ready to process. Then vx[0] through vx[15]. Every byte fetched is used. The SIMD unit can process four or eight values per instruction.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 4000000

// Array of Structs
typedef struct { float x, y, z, vx, vy, vz; float mass; int type; } ParticleAoS;

// Struct of Arrays
typedef struct {
    float *x, *y, *z, *vx, *vy, *vz, *mass;
    int *type;
} ParticleSoA;

void update_aos(ParticleAoS *p, int n, float dt) {
    for (int i = 0; i < n; i++) {
        p[i].x += p[i].vx * dt;
        p[i].y += p[i].vy * dt;
        p[i].z += p[i].vz * dt;
    }
}

void update_soa(ParticleSoA *p, int n, float dt) {
    for (int i = 0; i < n; i++) {
        p->x[i] += p->vx[i] * dt;
        p->y[i] += p->vy[i] * dt;
        p->z[i] += p->vz[i] * dt;
    }
}

int main() {
    float dt = 0.016f;

    ParticleAoS *aos = calloc(N, sizeof(ParticleAoS));
    for (int i = 0; i < N; i++) {
        aos[i].vx = 1.0f; aos[i].vy = 2.0f; aos[i].vz = 3.0f;
    }

    ParticleSoA soa;
    soa.x = calloc(N, sizeof(float)); soa.y = calloc(N, sizeof(float));
    soa.z = calloc(N, sizeof(float)); soa.vx = calloc(N, sizeof(float));
    soa.vy = calloc(N, sizeof(float)); soa.vz = calloc(N, sizeof(float));
    soa.mass = calloc(N, sizeof(float)); soa.type = calloc(N, sizeof(int));
    for (int i = 0; i < N; i++) {
        soa.vx[i] = 1.0f; soa.vy[i] = 2.0f; soa.vz[i] = 3.0f;
    }

    struct timespec start, end;

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int iter = 0; iter < 100; iter++) update_aos(aos, N, dt);
    clock_gettime(CLOCK_MONOTONIC, &end);
    double t_aos = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int iter = 0; iter < 100; iter++) update_soa(&soa, N, dt);
    clock_gettime(CLOCK_MONOTONIC, &end);
    double t_soa = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("AoS: %.3fs\n", t_aos);
    printf("SoA: %.3fs\n", t_soa);
    printf("SoA speedup: %.1fx\n", t_aos / t_soa);

    free(aos);
    free(soa.x); free(soa.y); free(soa.z);
    free(soa.vx); free(soa.vy); free(soa.vz);
    free(soa.mass); free(soa.type);
    return 0;
}

Typical results with gcc -O2:

AoS: 1.142s
SoA: 0.387s
SoA speedup: 2.9x

A 3x speedup from rearranging data layout. The algorithm is identical. The operations are identical. Only the memory layout changed. The SoA version is faster because the CPU fetches only the data it needs, packs more useful values per cache line, and enables SIMD auto-vectorization.

Game engines, physics simulators, and database column stores all use SoA layouts. Entity Component Systems (ECS) — the architecture behind Unity’s DOTS and many custom engines — exist specifically to get data into SoA-friendly layouts.

Why Python Is Structurally Cache-Hostile

A Python list of integers:

numbers = [1, 2, 3, 4, 5]

In memory, numbers is a contiguous array of PyObject* pointers — 8 bytes each, nicely packed. But each pointer leads to a PyLongObject scattered somewhere on the heap. Iterating the list means:

Load pointer from the list’s internal array (cache-friendly — contiguous)
Follow pointer to the PyObject (cache miss — heap-scattered)
Read the type tag to confirm it’s an integer (in the same cache line as the object, usually)
Read the actual integer value (same object, same cache line)
Repeat

Step 2 is the killer. Every element is a pointer chase to a random heap location. The prefetcher can handle the contiguous pointer array, but it cannot predict where each PyObject lives.

This is why NumPy exists. A NumPy array of integers:

import numpy as np
numbers = np.array([1, 2, 3, 4, 5], dtype=np.int64)

This stores the raw 8-byte integers contiguously — no PyObject wrapper, no pointer indirection. Iterating in C (inside NumPy’s compiled code) streams through cache lines at full memory bandwidth. The performance difference between sum(python_list) and np.sum(numpy_array) for large arrays is 10-50x, and most of that gap is cache behavior.

Seeing Cache Misses: `perf stat`

On Linux, the perf tool exposes CPU hardware performance counters. You can see exactly how many cache misses your program incurs:

perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./array_vs_list

Sample output for the array iteration from the parent chapter’s benchmark:

 Performance counter stats for './array_sum':

       312,841,229      cache-references
         1,247,382      cache-misses              #    0.40 % of all cache refs
     2,501,847,123      L1-dcache-loads
        12,383,412      L1-dcache-load-misses     #    0.49 % of all L1-dcache hits

       0.008127451 seconds time elapsed

And for the shuffled linked list:

 Performance counter stats for './list_sum':

       287,493,102      cache-references
       141,283,847      cache-misses              #   49.14 % of all cache refs
     2,487,291,038      L1-dcache-loads
       198,472,918      L1-dcache-load-misses     #    7.98 % of all L1-dcache hits

       0.183294712 seconds time elapsed

The array has a 0.4% cache miss rate. The linked list has a 49% cache miss rate. Same number of elements, same operation, same CPU. The linked list misses the cache on every other access, and each miss costs ~100 nanoseconds of stalling.

These numbers are not theoretical. They come from hardware counters on the actual CPU. perf stat is the closest thing you have to a speedometer for memory access patterns.

When Cache Locality Matters — and When It Doesn’t

Not every program is bottlenecked by cache behavior. Cache locality matters when:

You process large datasets sequentially (data pipelines, numerical code, game loops)
Your inner loop accesses more data than fits in L1/L2 cache
Your workload is CPU-bound rather than I/O-bound
You’re in a latency-sensitive path (trading systems, real-time audio, game ticks)

Cache locality does not matter when:

Your bottleneck is network I/O (most web services most of the time)
Your dataset fits in L2 cache (under ~256 KB of working data)
Your code is dominated by branch mispredictions or other pipeline stalls
You’re writing a CRUD app that spends 95% of its time waiting for Postgres

The mistake is not failing to optimize cache behavior in a Django view. The mistake is not knowing when cache behavior is your bottleneck, and therefore being unable to diagnose or fix the problem when it is. You can’t profile what you can’t name. You can’t fix what you can’t see.

Most engineers never run perf stat. Most engineers can’t explain why an array of values is faster than a list of pointers. Most engineers have no mental model for how the data they manipulate travels between RAM and the CPU. The abstraction told them not to worry about it, and they listened.