Probabilistic Data Structures: Bloom Filters, HyperLogLog, and Count-Min Sketch as Production Tools
Probabilistic Data Structures: Bloom Filters, HyperLogLog, and Count-Min Sketch as Production Tools
The content platform ingests 50,000 articles per day from RSS feeds, social media APIs, and partner syndication. Of those 50,000, approximately 15,000 are duplicates: the same article syndicated across multiple sources, or the same article re-fetched because the source’s “last modified” timestamp changed without content changes. The duplicate detection problem is straightforward: has this article (identified by content hash) been seen before?
The straightforward solution is a HashSet<String> containing all previously seen content hashes. After one year, that set contains 18 million entries. Each String content hash (64-character hex) is 128 bytes (object header + char array header + 64 chars at 2 bytes each). Each HashSet entry is a HashMap.Node at ~48 bytes. Total memory: 18 million * 176 bytes = ~3.2 GB.
A Bloom filter containing 18 million entries with a 1% false positive rate uses 17 MB. That is 188x less memory. The cost: 1% of duplicate articles pass through and must be caught by a secondary check. For the ingestion pipeline, where duplicates are deduplicated again at the database layer, a 1% false positive rate is acceptable.
This chapter implements three probabilistic data structures from scratch, benchmarks them against exact alternatives, then shows the production libraries that handle the edge cases.
The Probabilistic Contract
Every probabilistic data structure makes the same trade: memory for accuracy. The contract is precise:
- Bloom filter: “If I say no, the element is definitely not in the set. If I say yes, the element is probably in the set (false positive rate = p).”
- HyperLogLog: “The count of unique elements is approximately N, with standard error of ~1.04 / sqrt(2^precision).”
- Count-Min Sketch: “The frequency of this element is at most f. It might be higher than the true frequency (overcounting), but never lower.”
No undercounting. No false negatives (for Bloom filters). The errors are bounded and configurable. You choose the memory budget and get a guaranteed accuracy bound.
Bloom Filter: The Core Idea
A Bloom filter is a bit array of m bits, initialized to all zeros. To add an element, hash it with k independent hash functions, each producing an index in [0, m). Set those k bits to 1. To test membership, hash the element with the same k functions and check if all k bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element is probably in the set.
The false positive probability after inserting n elements:
$$p \approx \left(1 - e^{-kn/m}\right)^k$$
The optimal number of hash functions:
$$k = \frac{m}{n} \ln 2$$
Given a desired false positive rate p and expected element count n:
$$m = -\frac{n \ln p}{(\ln 2)^2}$$
For the content platform: n = 18,000,000, p = 0.01:
$$m = -\frac{18{,}000{,}000 \times \ln(0.01)}{(\ln 2)^2} = -\frac{18{,}000{,}000 \times (-4.605)}{0.4805} \approx 172{,}500{,}000 \text{ bits} \approx 20.6 \text{ MB}$$
$$k = \frac{172{,}500{,}000}{18{,}000{,}000} \times 0.693 \approx 6.64 \approx 7$$
20.6 MB with 7 hash functions gives a 1% false positive rate for 18 million articles. Compare to 3.2 GB for the exact HashSet.
public class BloomFilter {
private final long[] bits;
private final int numBits;
private final int numHashFunctions;
public BloomFilter(int expectedElements, double falsePositiveRate) {
this.numBits = optimalNumBits(expectedElements, falsePositiveRate);
this.numHashFunctions = optimalNumHashes(numBits, expectedElements);
this.bits = new long[(numBits + 63) >>> 6]; // Round up to long boundary
}
public void add(byte[] element) {
long hash64 = murmurHash64(element);
int h1 = (int) hash64;
int h2 = (int) (hash64 >>> 32);
for (int i = 0; i < numHashFunctions; i++) {
int combinedHash = h1 + i * h2;
int index = (combinedHash & Integer.MAX_VALUE) % numBits;
bits[index >>> 6] |= 1L << (index & 63);
}
}
public boolean mightContain(byte[] element) {
long hash64 = murmurHash64(element);
int h1 = (int) hash64;
int h2 = (int) (hash64 >>> 32);
for (int i = 0; i < numHashFunctions; i++) {
int combinedHash = h1 + i * h2;
int index = (combinedHash & Integer.MAX_VALUE) % numBits;
if ((bits[index >>> 6] & (1L << (index & 63))) == 0) {
return false; // Definitely not present
}
}
return true; // Probably present
}
private static int optimalNumBits(int n, double p) {
return (int) Math.ceil(-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
private static int optimalNumHashes(int m, int n) {
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
private static long murmurHash64(byte[] data) {
long h = 0xcbf29ce484222325L;
for (byte b : data) {
h ^= b;
h *= 0x100000001b3L;
}
return h;
}
}
The implementation uses the double-hashing technique from Kirsch and Mitzenmacher (2004): generate two hash values h1 and h2, then derive k hash functions as h1 + i*h2. This gives the same false positive rate as k independent hash functions with only one hash computation.
HyperLogLog: The Core Idea
HyperLogLog estimates the number of distinct elements in a stream using O(log log n) memory. The insight: if you hash elements uniformly, the maximum number of leading zeros in any hash value is approximately log2(n) where n is the number of distinct elements. One element with 20 leading zeros implies ~1 million distinct elements.
A single maximum is noisy. HyperLogLog splits elements into 2^p registers (buckets) using the first p bits of the hash. Each register stores the maximum number of leading zeros seen in its bucket. The estimate combines all registers using a harmonic mean:
$$E = \alpha_m \cdot m^2 \cdot \left(\sum_{j=1}^{m} 2^{-M_j}\right)^{-1}$$
where m = 2^p registers, M_j is the value of register j, and alpha_m is a bias correction constant.
Standard error: 1.04 / sqrt(m). With 2^14 = 16,384 registers (p = 14), the error is ~0.81%. Memory: 16,384 registers * 6 bits each = ~12 KB. That estimates cardinalities up to billions with less than 1% error.
Count-Min Sketch: The Core Idea
Count-Min Sketch estimates element frequency. It uses d independent hash functions and a d x w matrix of counters. To increment: hash the element with each function, increment the corresponding counter in each row. To query: hash, return the minimum counter value across all rows.
The minimum eliminates collisions from other elements. With d = ceil(ln(1/delta)) and w = ceil(e/epsilon), the estimate for any element overestimates by at most epsilon * N (total count of all elements) with probability at least 1 - delta.
For the content platform’s recommendation engine, Count-Min Sketch tracks article view frequencies across the last hour. With epsilon = 0.001 and delta = 0.01, the sketch uses 5 rows and ~2,718 columns = ~13,600 counters = ~54 KB. It provides frequency estimates within 0.1% of total views for any article.
Memory vs Accuracy: The Production Decision
| Structure | 18M Elements | Memory | Accuracy |
|---|---|---|---|
HashSet<String> | exact | 3.2 GB | 100% |
| Bloom filter (1% FPR) | membership | 20.6 MB | 99% (no false negatives) |
| Bloom filter (0.1% FPR) | membership | 27.5 MB | 99.9% |
| HyperLogLog (p=14) | count only | 12 KB | ~0.81% error |
| Count-Min Sketch | frequency | 54 KB | overcounts by <0.1% of total |
The content platform uses all three:
- Bloom filter in the ingestion pipeline to skip known duplicates before the expensive content-hash database lookup.
- HyperLogLog for the unique-visitor counter on the analytics dashboard, updated on every page view.
- Count-Min Sketch in the recommendation engine to estimate article popularity without maintaining exact per-article counters.
The sections that follow implement each from scratch, benchmark against exact alternatives, then show the production libraries (Guava for Bloom filters, Redis for HyperLogLog and sorted frequency tracking).