Cursor Leaks and the Silent Memory Catastrophe

The Symptom

The telemetry service’s heap usage climbs steadily over 48 hours. It starts at 800 MB after a restart and reaches 1.8 GB before the next scheduled restart. The heap dump shows thousands of com.mongodb.internal.connection.DefaultServerConnection instances that should have been collected. Simultaneously, db.serverStatus().metrics.cursor.open.total on the MongoDB server shows 2,400 open cursors, a number that should be near zero.

The Cause

The service iterates over query results using MongoCursor but does not close the cursor in all code paths. When an exception is thrown during iteration, or when the method returns early, the cursor is abandoned. The client-side cursor object holds references to network buffers and decoded documents. The server-side cursor holds the query execution context and read locks.

// SLOW: Cursor leak on exception
public List<TelemetryReading> getRecentReadings(String sensorId) {
    MongoCursor<Document> cursor = collection.find(
        Filters.eq("sensorId", sensorId)
    ).sort(Sorts.descending("ts"))
     .limit(1000)
     .iterator();

    List<TelemetryReading> results = new ArrayList<>();
    while (cursor.hasNext()) {
        Document doc = cursor.next();
        // If this throws, cursor is never closed
        TelemetryReading reading = mapToReading(doc);
        results.add(reading);

        if (results.size() >= 100) {
            return results;  // Cursor leaked: close() never called
        }
    }
    cursor.close();  // Only reached if loop completes
    return results;
}

This method has two cursor leak paths: the exception path and the early return path. Each leaked cursor holds approximately 64 KB of client-side buffers (the default batch size worth of documents) plus server-side resources.

At 100 requests per second, if 1% of requests leak a cursor, that is 1 cursor per second. After 24 hours: 86,400 leaked cursors. Each cursor holds 64 KB on the client: 5.4 GB of leaked memory. The JVM does not collect them because the MongoCursor implementation holds references to internal driver objects that are themselves referenced by the connection pool’s lifecycle management.

The Benchmark

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 5)
@Measurement(iterations = 5, time = 10)
@Fork(1)
@State(Scope.Benchmark)
public class CursorManagementBenchmark {

    private MongoCollection<Document> collection;

    @Setup
    public void setup() {
        MongoClient client = MongoClients.create("mongodb://localhost:27017");
        collection = client.getDatabase("telemetry").getCollection("readings");
    }

    @Benchmark
    public List<Document> unsafeCursor() {
        MongoCursor<Document> cursor = collection.find(
            Filters.eq("sensorId", "sensor-00001")
        ).limit(100).iterator();

        List<Document> results = new ArrayList<>();
        while (cursor.hasNext()) {
            results.add(cursor.next());
        }
        cursor.close();
        return results;
    }

    @Benchmark
    public List<Document> tryWithResources() {
        try (MongoCursor<Document> cursor = collection.find(
            Filters.eq("sensorId", "sensor-00001")
        ).limit(100).iterator()) {
            List<Document> results = new ArrayList<>();
            while (cursor.hasNext()) {
                results.add(cursor.next());
            }
            return results;
        }
    }

    @Benchmark
    public List<Document> intoMethod() {
        return collection.find(
            Filters.eq("sensorId", "sensor-00001")
        ).limit(100).into(new ArrayList<>());
    }
}

Results:

Benchmark                                Mode  Cnt    Score   Error  Units
CursorManagementBenchmark.unsafeCursor   avgt    5  312.000 ± 8.000  us/op
CursorManagementBenchmark.tryWithResources avgt  5  315.000 ± 7.000  us/op
CursorManagementBenchmark.intoMethod     avgt    5  308.000 ± 6.000  us/op

Performance is identical. The into() method and try-with-resources add no measurable overhead. There is zero reason to use the unsafe pattern.

The Fix

Three safe patterns, in order of preference:

Pattern 1: Use into() for bounded results.

// FAST: into() handles cursor lifecycle automatically
List<Document> results = collection.find(
    Filters.eq("sensorId", sensorId)
).sort(Sorts.descending("ts"))
 .limit(100)
 .into(new ArrayList<>());

into() creates the cursor, iterates to completion, closes the cursor, and returns the results. It is safe against exceptions. Use this when you want all results in memory.

Pattern 2: Try-with-resources for streaming.

// FAST: try-with-resources guarantees cursor close
try (MongoCursor<Document> cursor = collection.find(
    Filters.eq("sensorId", sensorId)
).sort(Sorts.descending("ts"))
 .batchSize(100)
 .iterator()) {

    while (cursor.hasNext()) {
        Document doc = cursor.next();
        processReading(doc);
    }
}

The cursor is closed when the try block exits, whether by normal completion, early return, or exception.

Pattern 3: forEach for side-effect processing.

// FAST: forEach handles cursor lifecycle
collection.find(Filters.eq("sensorId", sensorId))
    .sort(Sorts.descending("ts"))
    .limit(1000)
    .forEach(doc -> processReading(doc));

The Proof

After fixing all cursor management to use into() and try-with-resources:

Metric	Before (leaked cursors)	After (safe patterns)
Open cursors (server)	2,400 after 24h	3-8 at any time
Client heap after 24h	1.8 GB (climbing)	820 MB (stable)
Full GC frequency	Every 4 hours	None (G1GC mixed only)
Latency impact of GC	400ms full GC pauses	15ms mixed GC pauses

The Trade-off

The into() method loads all results into memory at once. For queries that return millions of documents, this is not viable. Use try-with-resources with batchSize() for large result sets, processing documents in batches of 100-1,000 and allowing the cursor to fetch the next batch from the server. This trades memory for network round trips but keeps the heap bounded.

Server-side cursors have a default timeout of 10 minutes (cursorTimeoutMillis). Even if the client leaks a cursor, the server will eventually clean it up. But 10 minutes of leaked server resources across hundreds of concurrent connections adds up. The fix is always on the client side.