Searching at Scale: Binary Search, Index Structures, Crossover Points
Searching at Scale: Binary Search, Index Structures, Crossover Points
The main chapter showed that HashMap beats binary search for single-key lookup at all sizes. This section expands the analysis: range queries, prefix searches, multi-key lookups, and the memory overhead that changes the answer when data lives in constrained caches.
Binary Search: Cache-Friendly but Logarithmic
Binary search on a sorted int[] accesses O(log n) elements. For 1,000,000 elements, that is ~20 comparisons. Each comparison accesses a different cache line. The first comparison hits the middle (index 500,000), the second hits quarter points (250,000 or 750,000), and so on. The first 3-4 levels of the search likely miss L1 cache (32KB). The remaining levels may hit L2 (256KB) or L3 (6-30MB depending on CPU).
For a sorted int[] of 1,000,000 elements (4MB total), the array fits in L3 cache on most server CPUs. After the first lookup warms the relevant cache lines, subsequent lookups to nearby keys benefit from spatial locality:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class BinarySearchLocalityBenchmark {
@Param({"1000000"})
int size;
int[] sorted;
int[] randomQueries;
int[] localQueries;
@Setup(Level.Trial)
public void setup() {
sorted = new int[size];
for (int i = 0; i < size; i++) sorted[i] = i * 3;
ThreadLocalRandom rng = ThreadLocalRandom.current();
// Random queries across full range
randomQueries = new int[1000];
for (int i = 0; i < 1000; i++) {
randomQueries[i] = rng.nextInt(size) * 3;
}
// Local queries clustered in a small range
localQueries = new int[1000];
int center = (size / 2) * 3;
for (int i = 0; i < 1000; i++) {
localQueries[i] = center + rng.nextInt(-1000, 1001) * 3;
}
}
@Benchmark
public int searchRandom() {
int sum = 0;
for (int q : randomQueries) {
sum += Arrays.binarySearch(sorted, q); // Cache-cold
}
return sum;
}
@Benchmark
public int searchLocal() {
int sum = 0;
for (int q : localQueries) {
sum += Arrays.binarySearch(sorted, q); // Cache-warm: FAST
}
return sum;
}
}
| Query Pattern | Time per Search (ns) |
|---|---|
| Random queries | 95 |
| Local queries (clustered) | 38 |
Clustered queries are 2.5x faster because the bottom levels of the binary search tree share cache lines. The content platform’s article timeline queries are clustered: users browse recent articles, so searches concentrate in the latest 5% of the sorted timestamp array.
TreeMap vs Sorted Array for Range Queries
TreeMap provides O(log n) lookup and O(log n + k) range queries through subMap(). A sorted array provides O(log n) lookup through Arrays.binarySearch and O(log n + k) range queries through index arithmetic. The difference is memory layout.
TreeMap stores each entry in a TreeMap.Entry node object (40 bytes header + key ref + value ref + left/right/parent refs + color bit = ~64 bytes per entry). Each node is a separate heap object. Traversing the tree during a range query chases pointers across the heap.
A sorted array stores keys contiguously. Range query iteration is sequential memory access.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class RangeQueryBenchmark {
@Param({"100000", "500000", "1000000"})
int size;
@Param({"100", "1000", "10000"})
int rangeSize;
TreeMap<Long, String> treeMap;
long[] sortedTimestamps;
String[] articleIds;
long queryStart;
@Setup(Level.Trial)
public void setup() {
treeMap = new TreeMap<>();
sortedTimestamps = new long[size];
articleIds = new String[size];
long baseTime = 1_700_000_000_000L; // Epoch millis
for (int i = 0; i < size; i++) {
long ts = baseTime + i * 60_000L; // One article per minute
sortedTimestamps[i] = ts;
articleIds[i] = "art-" + i;
treeMap.put(ts, articleIds[i]);
}
// Query the middle range
int startIdx = size / 2;
queryStart = sortedTimestamps[startIdx];
}
@Benchmark
public List<String> rangeQueryTree() {
long queryEnd = queryStart + rangeSize * 60_000L;
NavigableMap<Long, String> sub = treeMap.subMap(
queryStart, true, queryEnd, true);
return new ArrayList<>(sub.values()); // SLOW: pointer chasing
}
@Benchmark
public List<String> rangeQueryArray() {
long queryEnd = queryStart + rangeSize * 60_000L;
int fromIdx = Arrays.binarySearch(sortedTimestamps, queryStart);
if (fromIdx < 0) fromIdx = -(fromIdx + 1);
int toIdx = Arrays.binarySearch(sortedTimestamps, queryEnd);
if (toIdx < 0) toIdx = -(toIdx + 1);
else toIdx++;
List<String> result = new ArrayList<>(toIdx - fromIdx);
for (int i = fromIdx; i < toIdx; i++) {
result.add(articleIds[i]); // FAST: sequential access
}
return result;
}
}
| Size | Range | TreeMap (us) | Sorted Array (us) | Speedup |
|---|---|---|---|---|
| 100K | 100 | 4.2 | 0.8 | 5.3x |
| 100K | 1,000 | 32 | 5.1 | 6.3x |
| 100K | 10,000 | 310 | 48 | 6.5x |
| 1M | 100 | 6.8 | 1.1 | 6.2x |
| 1M | 1,000 | 55 | 6.4 | 8.6x |
| 1M | 10,000 | 520 | 58 | 9.0x |
The sorted array is 5-9x faster for range queries. The advantage grows with data size because TreeMap’s pointer-chasing penalty increases as entries spread across more heap pages.
Trade-off: TreeMap supports O(log n) insertion and deletion. The sorted array requires O(n) insertion (shift all elements). For the content platform’s article index, which rebuilds every few minutes from the database, the rebuild cost is acceptable. For an index that updates on every request, TreeMap is the right choice.
Memory Footprint Comparison
Memory determines whether your index fits in L3 cache, which determines whether lookups hit cache or main memory:
public class MemoryFootprintComparison {
public static void main(String[] args) {
int size = 1_000_000;
// Sorted int array: 4 bytes per element
// Total: 4 * 1,000,000 = 4 MB
int[] sortedInts = new int[size];
// Sorted long array: 8 bytes per element
// Total: 8 * 1,000,000 = 8 MB
long[] sortedLongs = new long[size];
// HashMap<Integer, String>:
// Bucket array: 16 bytes * 1,333,334 (capacity at 0.75 load) = ~20 MB
// Node objects: ~48 bytes each * 1,000,000 = ~48 MB
// Integer keys: ~16 bytes each * 1,000,000 = ~16 MB
// Total: ~84 MB
HashMap<Integer, String> hashMap = HashMap.newHashMap(size);
// TreeMap<Long, String>:
// Entry objects: ~64 bytes each * 1,000,000 = ~64 MB
// Long keys: ~24 bytes each * 1,000,000 = ~24 MB
// Total: ~88 MB
TreeMap<Long, String> treeMap = new TreeMap<>();
}
}
| Structure | 1M Entries | Fits in L3 (8MB)? |
|---|---|---|
int[] | 4 MB | Yes |
long[] | 8 MB | Barely |
HashMap<Integer, String> | ~84 MB | No |
TreeMap<Long, String> | ~88 MB | No |
The sorted int[] fits in L3 cache. HashMap and TreeMap do not. This explains why binary search on the sorted array approaches HashMap single-lookup performance at large sizes: the array benefits from cache hits while HashMap suffers cache misses on every pointer chase.
Interpolation Search: When Data Is Uniformly Distributed
Binary search always checks the middle element. If keys are uniformly distributed, interpolation search estimates the position:
public class InterpolationSearch {
public static int interpolationSearch(int[] sorted, int key) {
int low = 0;
int high = sorted.length - 1;
while (low <= high && key >= sorted[low] && key <= sorted[high]) {
if (low == high) {
return sorted[low] == key ? low : -(low + 1);
}
// Estimate position assuming uniform distribution
int pos = low + ((key - sorted[low]) * (high - low))
/ (sorted[high] - sorted[low]);
if (sorted[pos] == key) return pos;
if (sorted[pos] < key) low = pos + 1;
else high = pos - 1;
}
return -(low + 1);
}
}
Interpolation search is O(log log n) on uniformly distributed data vs O(log n) for binary search. For 1,000,000 elements: log2(1,000,000) = ~20 comparisons for binary search, log2(log2(1,000,000)) = ~4 comparisons for interpolation search.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class InterpolationVsBinaryBenchmark {
@Param({"1000000"})
int size;
int[] uniform; // 0, 3, 6, 9, ... (uniform spacing)
int[] skewed; // Non-uniform: clustered with gaps
int searchKey;
@Setup(Level.Trial)
public void setup() {
// Uniform distribution
uniform = new int[size];
for (int i = 0; i < size; i++) uniform[i] = i * 3;
// Skewed: exponential-ish distribution
skewed = new int[size];
for (int i = 0; i < size; i++) {
skewed[i] = (int) (Math.pow(i, 1.5));
}
searchKey = uniform[size / 3]; // Known key
}
@Benchmark
public int binaryUniform() {
return Arrays.binarySearch(uniform, searchKey);
}
@Benchmark
public int interpolationUniform() {
return InterpolationSearch.interpolationSearch(uniform, searchKey);
}
@Benchmark
public int binarySkewed() {
return Arrays.binarySearch(skewed, (int) Math.pow(size / 3, 1.5));
}
@Benchmark
public int interpolationSkewed() {
return InterpolationSearch.interpolationSearch(
skewed, (int) Math.pow(size / 3, 1.5));
}
}
| Distribution | Binary Search (ns) | Interpolation (ns) | Winner |
|---|---|---|---|
| Uniform | 95 | 22 | Interpolation (4.3x) |
| Skewed | 95 | 180 | Binary (1.9x faster) |
Interpolation search wins on uniform data because it jumps directly to near the target in 2-3 steps. On skewed data, the position estimate overshoots or undershoots, causing it to degenerate to O(n) in the worst case.
Applicability: The content platform’s article IDs are sequential (uniform distribution). Interpolation search on the ID index cuts lookup time by 4x. The view count distribution is heavily skewed (power law), so binary search remains the right choice there.
Composite Index: Multi-Key Lookup
The content platform often queries by category and date range: “all articles in ‘java’ published in the last 7 days.” A single sorted array handles one key. A composite index handles two:
public class CompositeArticleIndex {
// Entries sorted by (category, timestamp)
private record IndexEntry(String category, long timestamp, String articleId)
implements Comparable<IndexEntry> {
@Override
public int compareTo(IndexEntry other) {
int cmp = this.category.compareTo(other.category);
if (cmp != 0) return cmp;
return Long.compare(this.timestamp, other.timestamp);
}
}
private IndexEntry[] entries;
public void build(List<Article> articles) {
entries = articles.stream()
.map(a -> new IndexEntry(a.category(), a.publishTimestamp(), a.id()))
.sorted()
.toArray(IndexEntry[]::new);
}
public List<String> query(String category, long fromTime, long toTime) {
// Binary search for the start of the category+time range
IndexEntry fromKey = new IndexEntry(category, fromTime, "");
IndexEntry toKey = new IndexEntry(category, toTime, "\uffff");
int fromIdx = Arrays.binarySearch(entries, fromKey);
if (fromIdx < 0) fromIdx = -(fromIdx + 1);
int toIdx = Arrays.binarySearch(entries, toKey);
if (toIdx < 0) toIdx = -(toIdx + 1);
List<String> result = new ArrayList<>(toIdx - fromIdx);
for (int i = fromIdx; i < toIdx; i++) {
result.add(entries[i].articleId()); // FAST: sorted scan
}
return result;
}
}
The composite index replaces two separate lookups (filter by category, then filter by date) with a single binary search followed by a sequential scan. For 500,000 articles with 50 categories:
| Approach | Time (us) |
|---|---|
| HashMap(category) then filter dates | 420 |
| Composite sorted index | 12 |
The HashMap approach retrieves ~10,000 articles for the category and scans all of them. The composite index binary-searches to the exact start position and scans only matching entries.
The Search Strategy Decision Matrix
| Access Pattern | Best Structure | Why |
|---|---|---|
| Single-key exact lookup | HashMap | O(1) amortized, ~10ns |
| Range query on one key | Sorted array + binarySearch | O(log n + k), cache-friendly |
| Multi-key composite query | Composite sorted array | Single binarySearch, sequential scan |
| Frequent inserts + lookups | TreeMap | O(log n) for both |
| Uniform key distribution | Sorted array + interpolation | O(log log n), 4x over binary |
| Memory-constrained | Sorted primitive array | 20x less memory than HashMap |
| Concurrent reads + writes | ConcurrentSkipListMap | Lock-free range queries |
Each structure has a regime where it wins. The content platform uses all of them: HashMap for the session cache (single-key, frequent access), sorted arrays for the publish-date index (range queries, infrequent updates), and the composite index for category-filtered timeline queries.