Lock Contention: From synchronized to StampedLock

Lock contention is not a property of the lock. It is a property of how long the lock is held relative to how frequently it is requested. A synchronized block that executes in 10 nanoseconds and is called once per second has zero contention. The same block called by 32 threads in a tight loop will bring your throughput below single-threaded performance.

The content platform has three distinct locking patterns. The article cache is read-heavy: 500 reads per write. The view counter is write-heavy: all writes, reads only for display. The rate limiter is mixed: a read to check the current window, a write to increment. Each pattern demands a different lock.

The Benchmark Harness

All benchmarks in this section use the same structure. The state object holds the shared data. The @Threads annotation varies from 1 to 64. We measure throughput in operations per microsecond.

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 3)
@Fork(2)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class LockComparisonBenchmark {

    private long synchronizedCount = 0;
    private final ReentrantLock reentrantLock = new ReentrantLock();
    private long reentrantCount = 0;
    private final StampedLock stampedLock = new StampedLock();
    private long stampedCount = 0;

    // --- synchronized ---

    @Benchmark
    public long synchronizedIncrement() {
        synchronized (this) {
            return ++synchronizedCount;
        }
    }

    // --- ReentrantLock ---

    @Benchmark
    public long reentrantIncrement() {
        reentrantLock.lock();
        try {
            return ++reentrantCount;
        } finally {
            reentrantLock.unlock();
        }
    }

    // --- StampedLock write ---

    @Benchmark
    public long stampedIncrement() {
        long stamp = stampedLock.writeLock();
        try {
            return ++stampedCount;
        } finally {
            stampedLock.unlockWrite(stamp);
        }
    }
}

Running this at each thread count reveals the contention curve. Single-threaded, all three locks perform within 15% of each other. The differences emerge under pressure.

The Contention Curve

Threads	synchronized (ops/μs)	ReentrantLock (ops/μs)	StampedLock write (ops/μs)
1	215	198	185
2	142	155	148
4	78	122	118
8	38	98	93
16	17	74	70
32	10	56	53
64	6	41	38

synchronized loses 97% of its throughput between 1 and 64 threads. ReentrantLock loses 79%. StampedLock write loses 79%.

The synchronized collapse at 8+ threads is the contention cliff. The JVM has inflated the monitor to a fat lock, and every acquisition involves:

A CAS attempt that fails because another thread holds the lock
A brief spin that fails because the holder is not about to release
A pthread_mutex_lock call into the kernel
A context switch that evicts the thread’s cache lines
A wakeup after the lock releases, reloading cold cache data

Steps 4 and 5 are the expensive ones. Context switches cost 5-15 microseconds, but the secondary cost of cache pollution is often larger. The thread resumes with cold L1 and L2 caches, requiring hundreds of nanoseconds to reload working data.

Read-Write Separation with StampedLock

The article cache stores metadata that changes infrequently. Using a write lock for reads wastes most of the lock’s potential.

// SLOW: ReentrantReadWriteLock for the article cache
public class ArticleCacheSlow {

    private final ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock();
    private final Map<String, ArticleMetadata> cache = new HashMap<>();

    public ArticleMetadata get(String articleId) {
        rwLock.readLock().lock();
        try {
            return cache.get(articleId);
        } finally {
            rwLock.readLock().unlock();
        }
    }

    public void put(String articleId, ArticleMetadata metadata) {
        rwLock.writeLock().lock();
        try {
            cache.put(articleId, metadata);
        } finally {
            rwLock.writeLock().unlock();
        }
    }
}

ReentrantReadWriteLock allows concurrent reads, but readers must still acquire a read lock. Each acquisition is a CAS on a shared counter. Under 16 reading threads, that shared counter becomes a contention point.

// FAST: StampedLock with optimistic reads
public class ArticleCacheFast {

    private final StampedLock lock = new StampedLock();
    private final Map<String, ArticleMetadata> cache = new HashMap<>();

    public ArticleMetadata get(String articleId) {
        // Optimistic read: no lock acquired
        long stamp = lock.tryOptimisticRead();
        ArticleMetadata result = cache.get(articleId);

        // Validate: did a write happen during our read?
        if (!lock.validate(stamp)) {
            // Fall back to pessimistic read lock
            stamp = lock.readLock();
            try {
                result = cache.get(articleId);
            } finally {
                lock.unlockRead(stamp);
            }
        }
        return result;
    }

    public void put(String articleId, ArticleMetadata metadata) {
        long stamp = lock.writeLock();
        try {
            cache.put(articleId, metadata);
        } finally {
            lock.unlockWrite(stamp);
        }
    }
}

The optimistic read path calls tryOptimisticRead(), which returns a stamp without acquiring any lock. It then reads the data. After reading, validate(stamp) checks whether any write lock was acquired between the tryOptimisticRead and the validate. If no write occurred, the read completes without any synchronization. If a write did occur, we fall back to a pessimistic read lock.

For the content platform’s 500:1 read-write ratio, optimistic reads succeed 499 out of 500 times. The benchmark confirms:

Threads	RRWL read (ops/μs)	StampedLock optimistic (ops/μs)	Speedup
1	190	310	1.6x
8	145	295	2.0x
16	110	290	2.6x
32	82	285	3.5x
64	55	278	5.1x

The StampedLock optimistic read line is nearly flat. At 64 threads, it delivers 5x the throughput of ReentrantReadWriteLock.

The caveat: optimistic reads are not suitable for data that requires consistency during the read. If your read operation performs multiple field accesses that must be consistent with each other, the optimistic path might see a torn write (field A updated, field B not yet). The validate call will catch this and fall back to the pessimistic path, but the initial read may have observed inconsistent state. For the article cache, where we read a single reference from a HashMap, this is safe. For reading multiple fields from a mutable object, you need the pessimistic fallback.

Lock Coarsening vs Lock Splitting

Two opposing strategies exist for reducing contention, and choosing wrong makes things worse.

Lock coarsening merges multiple fine-grained lock acquisitions into one. This reduces the number of lock operations but increases hold time.

// SLOW: Lock acquired per field access
public void updateArticleStats(String id, long views, long shares) {
    synchronized (this) {
        viewCounts.merge(id, views, Long::sum);
    }
    synchronized (this) {
        shareCounts.merge(id, shares, Long::sum);
    }
}

// FAST (sometimes): Single lock acquisition for both
public void updateArticleStats(String id, long views, long shares) {
    synchronized (this) {
        viewCounts.merge(id, views, Long::sum);
        shareCounts.merge(id, shares, Long::sum);
    }
}

The JIT compiler performs lock coarsening automatically when it detects adjacent synchronized blocks on the same monitor. But it only does this within a single method, and only when no interleaving code exists between the blocks.

Lock splitting breaks one lock into multiple, reducing the number of threads competing for each lock.

// SLOW: Single lock for all article operations
public class ArticleService {

    private final Object lock = new Object();
    private final Map<String, Long> viewCounts = new HashMap<>();
    private final Map<String, Long> shareCounts = new HashMap<>();
    private final Map<String, String> drafts = new HashMap<>();

    public void recordView(String id) {
        synchronized (lock) {
            viewCounts.merge(id, 1L, Long::sum);
        }
    }

    public void recordShare(String id) {
        synchronized (lock) {
            shareCounts.merge(id, 1L, Long::sum);
        }
    }

    public void saveDraft(String id, String content) {
        synchronized (lock) {
            drafts.put(id, content);
        }
    }
}

// FAST: Separate locks per concern
public class ArticleService {

    private final Object viewLock = new Object();
    private final Object shareLock = new Object();
    private final Object draftLock = new Object();
    private final Map<String, Long> viewCounts = new HashMap<>();
    private final Map<String, Long> shareCounts = new HashMap<>();
    private final Map<String, String> drafts = new HashMap<>();

    public void recordView(String id) {
        synchronized (viewLock) {
            viewCounts.merge(id, 1L, Long::sum);
        }
    }

    public void recordShare(String id) {
        synchronized (shareLock) {
            shareCounts.merge(id, 1L, Long::sum);
        }
    }

    public void saveDraft(String id, String content) {
        synchronized (draftLock) {
            drafts.put(id, content);
        }
    }
}

Threads recording views no longer block threads saving drafts. If views account for 80% of traffic, splitting the lock reduces contention on the view path by 80%.

The decision rule: coarsen when you have many rapid, adjacent acquisitions of the same lock (the overhead of lock/unlock exceeds the hold time). Split when different operations access independent data through the same lock.

Profiling Lock Contention

async-profiler’s lock mode captures every contention event:

# Start lock contention profiling
./asprof -e lock -d 60 -f contention.html <pid>

The resulting flame graph shows where threads block. Width represents total blocked time, not count. A narrow frame called a million times matters less than a wide frame called once per second with a 100ms hold time.

What to look for in the lock contention profile:

Application synchronized blocks appearing wide: your code is the bottleneck. Measure hold time. Reduce it by moving I/O outside the block, splitting the lock, or switching primitives.

ConcurrentHashMap.computeIfAbsent appearing wide: you are doing expensive work inside the lambda. The computeIfAbsent lambda runs under a bucket lock. Extract the computation.

java.util.concurrent.locks.AbstractQueuedSynchronizer: ReentrantLock contention. The lock itself is not the problem. The hold time is. Reduce the critical section.

sun.misc.Unsafe.park dominating: threads are sleeping on OS mutexes. This indicates fat lock inflation. Consider whether you need mutual exclusion at all, or whether a lock-free structure works.

The content platform’s profiling session revealed that the rate limiter’s synchronized block accounted for 34% of all lock wait time. The critical section performed a TreeMap.headMap().size() call to count requests in the current window. Replacing TreeMap with a circular buffer of atomic counters eliminated the lock entirely:

// SLOW: TreeMap under synchronized for sliding window rate limit
public class SlidingWindowRateLimiter {

    private final TreeMap<Long, Integer> windows = new TreeMap<>();
    private final int maxRequests;
    private final long windowMillis;

    public SlidingWindowRateLimiter(int maxRequests, long windowMillis) {
        this.maxRequests = maxRequests;
        this.windowMillis = windowMillis;
    }

    public synchronized boolean tryAcquire() {
        long now = System.currentTimeMillis();
        long cutoff = now - windowMillis;
        windows.headMap(cutoff).clear();
        int total = windows.values().stream()
                          .mapToInt(Integer::intValue).sum();
        if (total >= maxRequests) {
            return false;
        }
        windows.merge(now, 1, Integer::sum);
        return true;
    }
}

// FAST: Fixed-window approximation with AtomicInteger array
public class AtomicSlidingWindowRateLimiter {

    private final AtomicInteger[] buckets;
    private final AtomicLong currentBucketIndex = new AtomicLong(0);
    private final int maxRequests;
    private final long bucketDurationMillis;
    private volatile long lastResetTime;

    public AtomicSlidingWindowRateLimiter(int maxRequests,
                                          long windowMillis,
                                          int numBuckets) {
        this.maxRequests = maxRequests;
        this.bucketDurationMillis = windowMillis / numBuckets;
        this.buckets = new AtomicInteger[numBuckets];
        for (int i = 0; i < numBuckets; i++) {
            buckets[i] = new AtomicInteger(0);
        }
        this.lastResetTime = System.currentTimeMillis();
    }

    public boolean tryAcquire() {
        advanceBuckets();
        int total = 0;
        for (AtomicInteger bucket : buckets) {
            total += bucket.get();
        }
        if (total >= maxRequests) {
            return false;
        }
        int idx = (int) (currentBucketIndex.get() % buckets.length);
        buckets[idx].incrementAndGet();
        return true;
    }

    private void advanceBuckets() {
        long now = System.currentTimeMillis();
        long elapsed = now - lastResetTime;
        long bucketsToAdvance = elapsed / bucketDurationMillis;
        if (bucketsToAdvance > 0) {
            for (long i = 0; i < Math.min(bucketsToAdvance, buckets.length); i++) {
                int idx = (int) ((currentBucketIndex.get() + 1 + i) % buckets.length);
                buckets[idx].set(0);
            }
            currentBucketIndex.addAndGet(bucketsToAdvance);
            lastResetTime = now;
        }
    }
}

The lock-free version is not a drop-in replacement. The bucket advancement has a race condition: two threads might both advance the window and reset buckets concurrently. For rate limiting, this is acceptable. The window might be slightly wider than configured. For strict correctness, you would need a CAS loop on the lastResetTime update. But for rate limiting, approximate correctness at 10x throughput is the right trade-off.

The Decision Tree

Is the shared state a single counter? Use LongAdder (write-heavy) or AtomicLong (need exact read-after-write).
Is the access pattern read-heavy (10:1 or higher)? Use StampedLock with optimistic reads.
Do you need fair ordering? Use ReentrantLock(true). Accept 10-15% throughput reduction for guaranteed FIFO.
Is the critical section under 100ns? Let the JIT handle lock coarsening. Use synchronized for clarity if thread count stays under 4.
Is the critical section over 1μs? Split the lock or move work outside the critical section. No lock primitive saves you from a long hold time.
Can you eliminate the lock entirely? Lock-free structures (ConcurrentHashMap, LongAdder, atomic variables) beat every lock under high contention.

Thread count under 4 is where synchronized is acceptable. Above 4 concurrent threads on the same lock, switch to ReentrantLock at minimum. Above 8 threads with read-heavy access, StampedLock is the only reasonable choice.