Cursor Leaks and the Silent Memory Catastrophe
Cursor Leaks and the Silent Memory Catastrophe
The Symptom
The telemetry service’s heap usage climbs steadily over 48 hours. It starts at 800 MB after a restart and reaches 1.8 GB before the next scheduled restart. The heap dump shows thousands of com.mongodb.internal.connection.DefaultServerConnection instances that should have been collected. Simultaneously, db.serverStatus().metrics.cursor.open.total on the MongoDB server shows 2,400 open cursors, a number that should be near zero.
The Cause
The service iterates over query results using MongoCursor but does not close the cursor in all code paths. When an exception is thrown during iteration, or when the method returns early, the cursor is abandoned. The client-side cursor object holds references to network buffers and decoded documents. The server-side cursor holds the query execution context and read locks.
// SLOW: Cursor leak on exception
public List<TelemetryReading> getRecentReadings(String sensorId) {
MongoCursor<Document> cursor = collection.find(
Filters.eq("sensorId", sensorId)
).sort(Sorts.descending("ts"))
.limit(1000)
.iterator();
List<TelemetryReading> results = new ArrayList<>();
while (cursor.hasNext()) {
Document doc = cursor.next();
// If this throws, cursor is never closed
TelemetryReading reading = mapToReading(doc);
results.add(reading);
if (results.size() >= 100) {
return results; // Cursor leaked: close() never called
}
}
cursor.close(); // Only reached if loop completes
return results;
}
This method has two cursor leak paths: the exception path and the early return path. Each leaked cursor holds approximately 64 KB of client-side buffers (the default batch size worth of documents) plus server-side resources.
At 100 requests per second, if 1% of requests leak a cursor, that is 1 cursor per second. After 24 hours: 86,400 leaked cursors. Each cursor holds 64 KB on the client: 5.4 GB of leaked memory. The JVM does not collect them because the MongoCursor implementation holds references to internal driver objects that are themselves referenced by the connection pool’s lifecycle management.
The Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 5)
@Measurement(iterations = 5, time = 10)
@Fork(1)
@State(Scope.Benchmark)
public class CursorManagementBenchmark {
private MongoCollection<Document> collection;
@Setup
public void setup() {
MongoClient client = MongoClients.create("mongodb://localhost:27017");
collection = client.getDatabase("telemetry").getCollection("readings");
}
@Benchmark
public List<Document> unsafeCursor() {
MongoCursor<Document> cursor = collection.find(
Filters.eq("sensorId", "sensor-00001")
).limit(100).iterator();
List<Document> results = new ArrayList<>();
while (cursor.hasNext()) {
results.add(cursor.next());
}
cursor.close();
return results;
}
@Benchmark
public List<Document> tryWithResources() {
try (MongoCursor<Document> cursor = collection.find(
Filters.eq("sensorId", "sensor-00001")
).limit(100).iterator()) {
List<Document> results = new ArrayList<>();
while (cursor.hasNext()) {
results.add(cursor.next());
}
return results;
}
}
@Benchmark
public List<Document> intoMethod() {
return collection.find(
Filters.eq("sensorId", "sensor-00001")
).limit(100).into(new ArrayList<>());
}
}
Results:
Benchmark Mode Cnt Score Error Units
CursorManagementBenchmark.unsafeCursor avgt 5 312.000 ± 8.000 us/op
CursorManagementBenchmark.tryWithResources avgt 5 315.000 ± 7.000 us/op
CursorManagementBenchmark.intoMethod avgt 5 308.000 ± 6.000 us/op
Performance is identical. The into() method and try-with-resources add no measurable overhead. There is zero reason to use the unsafe pattern.
The Fix
Three safe patterns, in order of preference:
Pattern 1: Use into() for bounded results.
// FAST: into() handles cursor lifecycle automatically
List<Document> results = collection.find(
Filters.eq("sensorId", sensorId)
).sort(Sorts.descending("ts"))
.limit(100)
.into(new ArrayList<>());
into() creates the cursor, iterates to completion, closes the cursor, and returns the results. It is safe against exceptions. Use this when you want all results in memory.
Pattern 2: Try-with-resources for streaming.
// FAST: try-with-resources guarantees cursor close
try (MongoCursor<Document> cursor = collection.find(
Filters.eq("sensorId", sensorId)
).sort(Sorts.descending("ts"))
.batchSize(100)
.iterator()) {
while (cursor.hasNext()) {
Document doc = cursor.next();
processReading(doc);
}
}
The cursor is closed when the try block exits, whether by normal completion, early return, or exception.
Pattern 3: forEach for side-effect processing.
// FAST: forEach handles cursor lifecycle
collection.find(Filters.eq("sensorId", sensorId))
.sort(Sorts.descending("ts"))
.limit(1000)
.forEach(doc -> processReading(doc));
The Proof
After fixing all cursor management to use into() and try-with-resources:
| Metric | Before (leaked cursors) | After (safe patterns) |
|---|---|---|
| Open cursors (server) | 2,400 after 24h | 3-8 at any time |
| Client heap after 24h | 1.8 GB (climbing) | 820 MB (stable) |
| Full GC frequency | Every 4 hours | None (G1GC mixed only) |
| Latency impact of GC | 400ms full GC pauses | 15ms mixed GC pauses |
The Trade-off
The into() method loads all results into memory at once. For queries that return millions of documents, this is not viable. Use try-with-resources with batchSize() for large result sets, processing documents in batches of 100-1,000 and allowing the cursor to fetch the next batch from the server. This trades memory for network round trips but keeps the heap bounded.
Server-side cursors have a default timeout of 10 minutes (cursorTimeoutMillis). Even if the client leaks a cursor, the server will eventually clean it up. But 10 minutes of leaked server resources across hundreds of concurrent connections adds up. The fix is always on the client side.