Caching at the Application Layer: Hit Rates, Eviction, and the Cache That Made Things Slower
Caching at the Application Layer: Hit Rates, Eviction, and the Cache That Made Things Slower
The content platform’s article service handles 15,000 reads per second. The team added a cache. Response times improved. Then a popular article went viral, the cache entry expired, and 3,000 concurrent requests hit the database simultaneously. The database connection pool saturated. Queries backed up. The service that was handling 15,000 req/s with a cache could not handle 3,000 req/s without one. The cache had made the system faster in the average case and fragile in the worst case.
Caching is not a performance optimization. It is a trade-off that exchanges memory and consistency for latency and throughput. Done poorly, it introduces failure modes that did not exist before: stampedes, stale data serving, memory pressure from unbounded growth, and cold-start latency spikes after deployments.
This chapter covers the content platform’s caching architecture: local caches with Caffeine, distributed caches with Redis, the eviction policies that determine whether a cache helps or hurts, and the stampede problem that turns cache expiration into a denial-of-service event.
The Cache That Made Things Slower
The team’s first cache implementation used a ConcurrentHashMap with manual TTL management:
// SLOW: Naive caching with stampede vulnerability
public class ArticleService {
private final ConcurrentHashMap<String, CachedArticle> cache =
new ConcurrentHashMap<>();
private final ArticleRepository repository;
public Article getArticle(String id) {
CachedArticle cached = cache.get(id);
if (cached != null && !cached.isExpired()) {
return cached.article();
}
// Cache miss: every concurrent request hits the database
Article article = repository.findById(id);
cache.put(id, new CachedArticle(article, Instant.now()));
return article;
}
record CachedArticle(Article article, Instant cachedAt) {
boolean isExpired() {
return Instant.now().isAfter(cachedAt.plusSeconds(300));
}
}
}
This implementation has four problems:
- Stampede vulnerability: When a cache entry expires, every concurrent request for that key hits the database. For a viral article receiving 500 req/s, that is 500 simultaneous database queries.
- Unbounded growth: The ConcurrentHashMap grows without limit. The platform has 200,000 articles. Caching all of them consumes 3 GB of heap.
- No eviction policy: Old, unread articles occupy the same space as trending articles with 10,000 views per hour.
- Manual TTL:
Instant.now()on every read adds clock overhead and does not handle time-based cleanup.
The stampede is the critical failure mode. Here is what happens when a popular article’s cache entry expires:
Time 0ms: Cache entry expires for article "viral-post-2024"
Time 1ms: Request A: cache miss -> query database
Time 2ms: Request B: cache miss -> query database
Time 3ms: Request C: cache miss -> query database
...
Time 10ms: Requests D through Z: all cache misses -> 23 more database queries
Time 50ms: Database connection pool exhausted (max 20 connections)
Time 51ms: Request AA: connection pool timeout -> 500 error
Time 200ms: Request A completes, populates cache
Time 201ms: Requests using cache again, but 150+ requests already failed
The database can handle 15 queries for different articles spread across time. It cannot handle 500 queries for the same article in a 50 ms window.
Caffeine: The Cache That Gets It Right
Caffeine is a high-performance, near-optimal caching library for the JVM. It uses the W-TinyLFU eviction policy (covered in depth in Section 1), which achieves hit rates within 1% of optimal. It handles stampede prevention, bounded sizing, and expiration without manual management.
// FAST: Caffeine with stampede prevention
public class ArticleService {
private final AsyncLoadingCache<String, Article> cache;
private final ArticleRepository repository;
public ArticleService(ArticleRepository repository) {
this.repository = repository;
this.cache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofMinutes(5))
.refreshAfterWrite(Duration.ofMinutes(4))
.buildAsync(this::loadArticle);
}
public CompletableFuture<Article> getArticle(String id) {
return cache.get(id);
}
private Article loadArticle(String id) {
return repository.findById(id);
}
}
Key configuration decisions:
maximumSize(10_000): Bounds memory. Caffeine evicts the least valuable entries when the limit is reached. For articles averaging 30 KB in memory, this bounds the cache at roughly 300 MB.expireAfterWrite(5 minutes): Hard TTL. No entry survives longer than 5 minutes, bounding staleness.refreshAfterWrite(4 minutes): Proactive refresh. After 4 minutes, the next access triggers an async reload while serving the stale value. This eliminates the stampede window.buildAsync: The loading function runs once per key, even under concurrent access. 500 simultaneous requests for the same key result in 1 database query.
The difference between expireAfterWrite and refreshAfterWrite is critical. Expiration removes the entry, causing a cache miss. Refresh keeps the stale entry available while loading the new value asynchronously. The combination means: entries refresh proactively at 4 minutes (no miss), and if the refresh fails, the stale entry expires at 5 minutes (bounded staleness).
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class CacheImplementationBenchmark {
private static final int ARTICLE_COUNT = 100_000;
private static final int CACHE_SIZE = 10_000;
private ConcurrentHashMap<String, Article> concurrentHashMap;
private Cache<String, Article> caffeineCache;
private String[] articleIds;
private Article[] articles;
@Setup(Level.Trial)
public void setup() {
concurrentHashMap = new ConcurrentHashMap<>();
caffeineCache = Caffeine.newBuilder()
.maximumSize(CACHE_SIZE)
.build();
articles = new Article[ARTICLE_COUNT];
articleIds = new String[ARTICLE_COUNT];
for (int i = 0; i < ARTICLE_COUNT; i++) {
articleIds[i] = "article-" + i;
articles[i] = createArticle(i);
}
// Pre-populate both caches
for (int i = 0; i < ARTICLE_COUNT; i++) {
concurrentHashMap.put(articleIds[i], articles[i]);
caffeineCache.put(articleIds[i], articles[i]);
}
}
@Benchmark
@Threads(8)
public Article concurrentHashMapGet(ThreadState state) {
// SLOW: Unbounded, no eviction, grows to consume all heap
return concurrentHashMap.get(articleIds[state.index()]);
}
@Benchmark
@Threads(8)
public Article caffeineGet(ThreadState state) {
// FAST: Bounded, W-TinyLFU eviction, near-optimal hit rate
return caffeineCache.getIfPresent(articleIds[state.index()]);
}
@State(Scope.Thread)
public static class ThreadState {
private int counter;
// Zipfian distribution: 20% of articles get 80% of requests
private final int[] accessPattern = generateZipfianPattern(ARTICLE_COUNT, 10_000);
public int index() {
return accessPattern[counter++ % accessPattern.length];
}
}
}
| Implementation | Throughput (ops/s) | Heap After 1M ops | Hit Rate |
|---|---|---|---|
| ConcurrentHashMap (100K entries) | 45M | 3.2 GB | 100% (no eviction) |
| Caffeine (10K max) | 42M | 320 MB | 94.2% (Zipfian) |
ConcurrentHashMap is 7% faster in raw throughput because it does no eviction bookkeeping. But it uses 10x the memory. And the 100% hit rate is deceptive: it assumes the map fits in heap. When the map grows beyond available heap, GC pressure destroys throughput. Caffeine’s 94.2% hit rate on a Zipfian workload (realistic for content platforms where popular articles dominate traffic) means it serves 94 out of 100 requests from cache while using bounded memory.
Eviction Policies: LRU Is Not the Answer
Most cache implementations default to Least Recently Used (LRU) eviction. LRU evicts the entry that was accessed least recently. This works well for temporal locality: if you accessed something recently, you will access it again soon. It works poorly for frequency-based workloads.
The content platform has both temporal and frequency patterns:
- Temporal: New articles get burst traffic that fades
- Frequency: Evergreen articles (“Java Performance Guide”) get steady, consistent traffic
LRU fails for the frequency case. A scan pattern (a batch job iterating all articles for reindexing) evicts the high-frequency evergreen articles in favor of entries that will never be accessed again.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class EvictionPolicyBenchmark {
private static final int TOTAL_ITEMS = 100_000;
private static final int CACHE_SIZE = 5_000;
@Param({"zipfian", "scan_then_zipfian"})
private String workloadType;
private Cache<String, Article> lruCache;
private Cache<String, Article> tinyLfuCache;
private String[] accessSequence;
@Setup(Level.Trial)
public void setup() {
// Guava-style LRU cache
lruCache = Caffeine.newBuilder()
.maximumSize(CACHE_SIZE)
.eviction() // forces SIZE-based, but Caffeine uses W-TinyLFU by default
.build();
tinyLfuCache = Caffeine.newBuilder()
.maximumSize(CACHE_SIZE)
.build(); // W-TinyLFU is the default
if ("scan_then_zipfian".equals(workloadType)) {
// Simulates reindexing scan followed by normal traffic
accessSequence = generateScanThenZipfian(TOTAL_ITEMS, 500_000);
} else {
accessSequence = generateZipfianKeys(TOTAL_ITEMS, 500_000);
}
}
@Benchmark
public long lruHitRate() {
long hits = 0;
for (String key : accessSequence) {
if (lruCache.getIfPresent(key) != null) {
hits++;
} else {
lruCache.put(key, loadArticle(key));
}
}
return hits;
}
@Benchmark
public long tinyLfuHitRate() {
long hits = 0;
for (String key : accessSequence) {
if (tinyLfuCache.getIfPresent(key) != null) {
hits++;
} else {
tinyLfuCache.put(key, loadArticle(key));
}
}
return hits;
}
}
| Workload | LRU Hit Rate | W-TinyLFU Hit Rate | Difference |
|---|---|---|---|
| Pure Zipfian | 82.1% | 94.2% | +12.1% |
| Scan then Zipfian | 41.3% | 91.8% | +50.5% |
The scan-then-Zipfian workload is devastating for LRU. The reindexing scan touches every article once, evicting all the frequently accessed articles. After the scan completes, the cache is cold: filled with articles from the tail of the Zipfian distribution that will rarely be accessed again. W-TinyLFU resists this because it requires both recency and frequency for admission. A scan that touches each item once does not build enough frequency to displace established high-frequency entries.
On the content platform, this scan pattern happens daily when the search indexer reindexes all articles. With an LRU cache, the 30-minute window after reindexing shows a 40% increase in database queries and a 200 ms increase in P99 latency. With W-TinyLFU, the reindexing scan has no measurable effect on cache hit rates.
Cache Sizing: The Memory Budget
Cache sizing is not “set it to 10,000 and forget it.” The right size depends on:
- Working set size: How many distinct items are accessed in the TTL window
- Item size: Memory per cached entry
- Available heap: Total heap minus application working memory
- Hit rate curve: The relationship between cache size and hit rate
The content platform measured the hit rate curve by running production traffic patterns against caches of varying sizes:
| Cache Size | Hit Rate | Memory | DB Queries/s (saved) |
|---|---|---|---|
| 1,000 | 71.2% | 32 MB | 10,680 |
| 5,000 | 89.4% | 160 MB | 13,410 |
| 10,000 | 94.2% | 320 MB | 14,130 |
| 25,000 | 96.8% | 800 MB | 14,520 |
| 50,000 | 97.4% | 1.6 GB | 14,610 |
The diminishing returns are clear. Going from 1,000 to 10,000 entries (10x increase) improves the hit rate by 23 percentage points. Going from 10,000 to 50,000 entries (5x increase) improves it by 3.2 percentage points. The team chose 10,000 entries: 320 MB of heap for a 94.2% hit rate, saving 14,130 database queries per second.
The formula for estimating optimal cache size on a Zipfian workload:
$$\text{optimal size} \approx \text{total items} \times (1 - \text{target miss rate})^{1/\alpha}$$
Where $\alpha$ is the Zipfian skew parameter (typically 0.8-1.2 for web traffic). For the content platform with 200,000 articles, $\alpha = 1.0$, and a target miss rate of 6%:
$$\text{optimal size} \approx 200{,}000 \times (1 - 0.06)^{1/1.0} \approx 200{,}000 \times 0.94 \approx 188{,}000$$
This formula overpredicts because it assumes uniform item importance. In practice, the Zipfian distribution means a small cache captures most of the value. The empirical hit-rate curve is the reliable tool.
Local vs Distributed: When You Need Both
The content platform runs on 8 application nodes behind a load balancer. Each node maintains a local Caffeine cache. This creates consistency problems:
- Node A caches article X at time T
- Article X is updated at time T+1
- Node B loads the updated article X
- Node A continues serving the stale version until TTL expires
For the content platform, 5-minute staleness is acceptable for articles. But for view counts displayed on the article page, 5-minute staleness means the counter shows 45,000 views while the real count is 45,800. Users notice.
The solution: a two-tier cache with Caffeine locally and Redis as the distributed layer.
// FAST: Two-tier caching with local and distributed layers
public class TieredArticleCache {
private final AsyncLoadingCache<String, Article> localCache;
private final RedisTemplate<String, Article> redis;
private final ArticleRepository repository;
public TieredArticleCache(RedisTemplate<String, Article> redis,
ArticleRepository repository) {
this.redis = redis;
this.repository = repository;
this.localCache = Caffeine.newBuilder()
.maximumSize(5_000)
.expireAfterWrite(Duration.ofSeconds(30))
.buildAsync(this::loadFromRedisOrDb);
}
public CompletableFuture<Article> getArticle(String id) {
return localCache.get(id);
}
private Article loadFromRedisOrDb(String id) {
// Check Redis first (distributed, longer TTL)
Article cached = redis.opsForValue().get("article:" + id);
if (cached != null) {
return cached;
}
// Miss on both layers: load from database
Article article = repository.findById(id);
// Populate Redis with longer TTL
redis.opsForValue().set("article:" + id, article,
Duration.ofMinutes(5));
return article;
}
public void invalidate(String id) {
// Invalidate both layers on write
redis.delete("article:" + id);
localCache.synchronous().invalidate(id);
}
}
The two tiers serve different purposes:
| Layer | TTL | Size | Latency | Purpose |
|---|---|---|---|---|
| Caffeine (local) | 30s | 5,000 | 50 ns | Hot-path acceleration |
| Redis (distributed) | 5 min | 200,000 | 0.5 ms | Cross-node consistency |
| Database | N/A | All | 2-10 ms | Source of truth |
Local cache handles 94% of requests at nanosecond latency. Redis handles 5% at sub-millisecond latency. The database handles 1% at millisecond latency. Write-through invalidation on the Redis layer propagates changes to all nodes within 30 seconds (the local cache TTL).
The trade-off: two caches mean two sources of potential staleness, two eviction policies to tune, and two layers of memory to budget. For read-heavy workloads with moderate consistency requirements (the content platform’s core use case), the complexity is justified by the 200x latency reduction on cache hits.
Measuring Cache Effectiveness
A cache without metrics is a cache you cannot reason about. The content platform exposes these Caffeine metrics through Micrometer:
public CacheMetrics configureCacheMetrics(MeterRegistry registry,
Cache<String, Article> cache) {
CaffeineCacheMetrics.monitor(registry, cache, "article-cache");
// Metrics exposed:
// cache.size - current entry count
// cache.hit.count - total hits
// cache.miss.count - total misses
// cache.eviction.count - total evictions
// cache.load.duration - time spent loading on misses
return new CacheMetrics(cache);
}
The metrics that matter:
- Hit rate: Below 80%? The cache is too small or the workload has no locality. Below 50%? The cache is actively harmful (overhead without benefit).
- Eviction rate: If the eviction rate is close to the miss rate, the cache is undersized. Entries are being evicted before they can be reused.
- Load duration P99: If the P99 load duration is above 50 ms, database queries under cache misses are slow. This amplifies stampede damage.
- Size vs maximum: If size is consistently at maximum, consider increasing the limit. If size is consistently at 60% of maximum, the cache is oversized and wasting heap.
The content platform sets alerts on hit rate dropping below 85% and load duration P99 exceeding 100 ms. Both conditions indicate the cache is no longer effectively protecting the database.
The Cold Start Problem
After a deployment, every node starts with an empty cache. At 15,000 req/s across 8 nodes, each node handles roughly 1,875 req/s. With an empty cache, all 1,875 requests hit the database. Rolling deployments help: only one node restarts at a time, and the other 7 absorb the load. But the cold node needs time to warm its cache.
// Cache warming on startup
@PostConstruct
public void warmCache() {
// Load the top 1,000 most-accessed articles from the last hour
List<String> topArticleIds = analyticsService
.getTopArticleIds(Duration.ofHours(1), 1_000);
topArticleIds.parallelStream()
.forEach(id -> {
try {
cache.get(id).join();
} catch (Exception e) {
log.warn("Failed to warm cache for article: {}", id);
}
});
log.info("Cache warmed with {} articles", topArticleIds.size());
}
Cache warming pre-loads the most frequently accessed entries, reducing the cold-start hit-rate dip from 0% to approximately 70% immediately after startup. The remaining 30% fills organically within 2-3 minutes of live traffic. Without warming, it takes 8-10 minutes to reach the steady-state 94% hit rate, during which the database bears 6x the normal load from that node.
The trade-off: cache warming adds 3-5 seconds to startup time and creates a burst of database queries during deployment. For the content platform, this is acceptable because deployments happen during low-traffic windows.