BSON Type Optimization: Measuring Storage and Network Impact

The Symptom

The telemetry collection stores 200 million documents and uses 180 GB of storage. The data team estimates that 200 million sensor readings at an average of 8 fields per reading should consume approximately 100 GB. The 80 GB overhead is somewhere in the type choices.

Running db.readings.stats() shows:

{
  count: 200000000,
  avgObjSize: 920,   // bytes per document
  storageSize: 183600000000,
  totalIndexSize: 48200000000
}

920 bytes per document for 8 fields of sensor data is inflated. A compact representation should be closer to 200-300 bytes.

The Cause

Examining a sample document reveals the problem:

{
  "_id": "550e8400-e29b-41d4-a716-446655440000",    // UUID as string: 41 bytes
  "sensorId": "sensor-00042",                        // Fine: 15 bytes
  "timestamp": "2024-01-15T10:30:00.000Z",           // ISO string: 27 bytes
  "temperature": "23.5",                              // Number as string: 8 bytes
  "humidity": "65.2",                                 // Number as string: 8 bytes
  "pressure": "1013.25",                              // Number as string: 11 bytes
  "isActive": "true",                                 // Boolean as string: 8 bytes
  "tags": ["indoor", "floor-3", "zone-a"],            // Fine
  "metadata": {
    "firmwareVersion": "2.1.0",
    "lastCalibration": "2024-01-10T00:00:00.000Z",   // ISO string: 27 bytes
    "batteryLevel": "87"                              // Number as string: 6 bytes
  }
}

Five types are wrong. UUID as string instead of ObjectId or BinData. Timestamps as strings instead of Date. Numbers as strings instead of Double or Int32. Boolean as string instead of Boolean.

The Benchmark

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 5)
@Measurement(iterations = 5, time = 10)
@Fork(1)
@State(Scope.Benchmark)
public class BsonTypeBenchmark {

    private byte[] stringDoc;
    private byte[] optimizedDoc;

    @Setup
    public void setup() {
        // Document with string types (the "before" schema)
        Document strDoc = new Document()
            .append("_id", "550e8400-e29b-41d4-a716-446655440000")
            .append("sensorId", "sensor-00042")
            .append("timestamp", "2024-01-15T10:30:00.000Z")
            .append("temperature", "23.5")
            .append("humidity", "65.2")
            .append("pressure", "1013.25")
            .append("isActive", "true");
        stringDoc = toBson(strDoc);

        // Document with optimized types (the "after" schema)
        Document optDoc = new Document()
            .append("_id", new ObjectId())
            .append("sensorId", "sensor-00042")
            .append("timestamp", new Date())
            .append("temperature", 23.5)
            .append("humidity", 65.2)
            .append("pressure", 1013.25)
            .append("isActive", true);
        optimizedDoc = toBson(optDoc);
    }

    @Benchmark
    public Document deserializeStringTypes() {
        return fromBson(stringDoc);
    }

    @Benchmark
    public Document deserializeOptimizedTypes() {
        return fromBson(optimizedDoc);
    }
}

Results:

Benchmark                                        Mode  Cnt     Score    Error  Units
BsonTypeBenchmark.deserializeStringTypes          avgt    5  1850.000 ± 45.000  ns/op
BsonTypeBenchmark.deserializeOptimizedTypes       avgt    5   980.000 ± 28.000  ns/op

Optimized types deserialize 1.9x faster. String types require UTF-8 decoding and String object allocation for every field. Native BSON types decode directly into primitive values.

The Fix

Migrate the schema with a bulk update. This is a one-time operation that can run during a maintenance window:

// FAST: Schema migration to optimized BSON types
public void migrateSchemaTypes(MongoCollection<Document> collection) {
    int batchSize = 10000;
    List<WriteModel<Document>> writes = new ArrayList<>(batchSize);

    try (MongoCursor<Document> cursor = collection.find()
        .batchSize(batchSize)
        .iterator()) {

        while (cursor.hasNext()) {
            Document doc = cursor.next();

            Document update = new Document("$set", new Document()
                .append("timestamp", Instant.parse(doc.getString("timestamp")))
                .append("temperature", Double.parseDouble(doc.getString("temperature")))
                .append("humidity", Double.parseDouble(doc.getString("humidity")))
                .append("pressure", Double.parseDouble(doc.getString("pressure")))
                .append("isActive", Boolean.parseBoolean(doc.getString("isActive")))
            );

            writes.add(new UpdateOneModel<>(
                Filters.eq("_id", doc.get("_id")),
                update
            ));

            if (writes.size() >= batchSize) {
                collection.bulkWrite(writes, new BulkWriteOptions().ordered(false));
                writes.clear();
            }
        }

        if (!writes.isEmpty()) {
            collection.bulkWrite(writes, new BulkWriteOptions().ordered(false));
        }
    }
}

For new documents, enforce correct types at the application layer:

// FAST: Correct BSON types from the start
Document reading = new Document()
    .append("_id", new ObjectId())
    .append("sensorId", sensorId)
    .append("timestamp", Date.from(Instant.now()))
    .append("temperature", temperature)      // double, not String
    .append("humidity", humidity)             // double, not String
    .append("pressure", pressure)            // double, not String
    .append("isActive", isActive);           // boolean, not String

The Proof

After migrating 200 million documents:

Metric	String types	Optimized types	Reduction
Avg document size	920 bytes	340 bytes	63%
Collection storage	180 GB	68 GB	62%
Index size (timestamp)	12.4 GB	4.8 GB	61%
Network per 100-doc query	92 KB	34 KB	63%
Deserialization time per doc	1,850 ns	980 ns	47%

The Trade-off

The migration requires reading and rewriting every document. For 200 million documents with batches of 10,000 and an average batch write time of 200ms, the migration takes approximately 4,000 seconds (67 minutes). During migration, write operations to the collection contend with the bulk updates. Run the migration during low-traffic periods, use ordered(false) to allow parallel execution within each batch, and set a WriteConcern of w:1 during migration for speed (verify replication after completion).

The _id field cannot be changed in place. Migrating from string UUIDs to ObjectId requires creating new documents and deleting old ones, which is a more invasive operation. For existing collections, keep the string _id and fix the other fields first. For new collections, always use ObjectId or BinData.