Skip to main content
data systems from the ground up

Linear Scan, O(n) Reads, and the Moment Your Database Stops Being Useful

5 min read Chapter 3 of 36

Linear Scan, O(n) Reads, and the Moment Your Database Stops Being Useful

The Black Box

The warehouse dashboard calls GET /api/packages/PKG-40291/status. The application queries the store. In development with 500 test records, the response arrives in 2ms. In production with 20 million records, the same query takes 8 seconds. The code did not change. The algorithm did not change. The data volume changed, and the linear scan that was invisible at 500 records became the dominant cost.

The Mechanism

A linear scan reads every record in the file from beginning to end. The cost is proportional to the total number of records, not the number of matching records. For a file with $n$ records, finding the latest status of one package requires reading all $n$ records, because the file is append-only and the latest entry could be the last line.

The actual time depends on where the file lives in the memory hierarchy:

In the page cache (file fits in RAM): The scan reads from memory at 10-20 GB/s. A 20 million line file where each line averages 80 bytes is approximately 1.6 GB. Reading 1.6 GB from memory takes roughly 100-160ms. Adding per-line string splitting and comparison brings the total to 2-4 seconds.

Not in the page cache (file exceeds available RAM or was recently evicted): The scan triggers disk I/O. Reading 1.6 GB sequentially from an NVMe SSD takes approximately 500ms. From a SATA SSD, 3 seconds. From an HDD, 9 seconds. Add per-line processing on top.

This is why the same query that runs in 2ms on the developer’s laptop takes 8 seconds in production. The developer’s file has 500 records. The scan is invisible. Production has 20 million records. The scan dominates.

The Observable Consequence

The logistics platform dashboard shows real-time package status. The operations team loads it 200 times per hour during peak operations. Each page load queries the status of 25 packages.

At 1 million total records, each query scans the entire file in approximately 200ms. The dashboard loads in under a second. Acceptable.

At 10 million records, each query takes 1.5 seconds. Twenty-five queries per page load, if executed sequentially, take 37 seconds. Even with concurrent execution, the file is being scanned 25 times. If the file is in the page cache, concurrent scans compete for CPU. If it is not, they compete for I/O bandwidth.

At 20 million records, the dashboard is unusable.

-- Concept: the SQL equivalent of the linear scan problem
-- This query on a table with no index on package_id
-- performs a sequential scan (Seq Scan) of the entire table

EXPLAIN ANALYZE
SELECT status, timestamp
FROM package_events
WHERE package_id = 'PKG-40291'
ORDER BY timestamp DESC
LIMIT 1;

-- Without index:
-- Seq Scan on package_events (cost=0.00..425891.00 rows=14 width=48)
--   Filter: (package_id = 'PKG-40291')
--   Rows Removed by Filter: 19999986
-- Planning Time: 0.08 ms
-- Execution Time: 3842.15 ms

-- The database read 20 million rows to find 14.

PostgreSQL’s EXPLAIN ANALYZE makes the linear scan visible. The Rows Removed by Filter: 19999986 line is the cost. The database evaluated the filter condition on 20 million rows to return 14. The 3.8 seconds of execution time is almost entirely I/O and CPU spent reading and discarding rows that do not match.

The Code

A benchmark that measures the scan at different file sizes makes the growth curve concrete:

// Concept: measuring linear scan degradation as file size grows
// This is not a production benchmark. It isolates the scan cost.
void benchmarkScan(Path dbFile) throws IOException {
    int[] sizes = {1_000, 100_000, 1_000_000, 10_000_000};
    String targetId = "PKG-00042";

    for (int size : sizes) {
        // Write file with 'size' records, target ID appears at random positions
        generateTestFile(dbFile, size, targetId);

        long start = System.nanoTime();
        String result = getLatestStatus(dbFile, targetId);
        long elapsed = System.nanoTime() - start;

        System.out.printf("Records: %,10d | Scan time: %,8d µs | Result: %s%n",
                size, elapsed / 1000, result);
    }
}

// Expected output (NVMe SSD, file in page cache):
// Records:      1,000 | Scan time:      280 µs | Result: IN_TRANSIT
// Records:    100,000 | Scan time:   18,400 µs | Result: IN_TRANSIT
// Records:  1,000,000 | Scan time:  192,000 µs | Result: IN_TRANSIT
// Records: 10,000,000 | Scan time: 1,840,000 µs | Result: IN_TRANSIT

The scan time grows linearly. At 10 million records, 1.84 seconds per query. Multiply by the number of concurrent queries the dashboard makes, and the system is unresponsive.

The Decision Rule

A linear scan is acceptable when the dataset is small enough that the scan completes within your latency budget. For the logistics platform, the budget is 100ms per query. At 80 bytes per record, that budget allows approximately 500,000 records before the scan exceeds 100ms with the file in the page cache, and far fewer when it is not.

If your dataset will grow past that threshold, and production datasets always grow, you need an index. The index trades write overhead and disk space for sub-linear read time. Chapter 2 covers what that tradeoff costs and when each index type is the right choice.

The append-only log does not go away. The index sits alongside it. Writes still append to the log. Reads consult the index instead of scanning the log. The log is the source of truth. The index is a derived data structure that can be rebuilt from the log at any time. This distinction matters for crash recovery (Chapter 3) and replication (Chapter 4).