BSON Type Optimization: Measuring Storage and Network Impact
BSON Type Optimization: Measuring Storage and Network Impact
The Symptom
The telemetry collection stores 200 million documents and uses 180 GB of storage. The data team estimates that 200 million sensor readings at an average of 8 fields per reading should consume approximately 100 GB. The 80 GB overhead is somewhere in the type choices.
Running db.readings.stats() shows:
{
count: 200000000,
avgObjSize: 920, // bytes per document
storageSize: 183600000000,
totalIndexSize: 48200000000
}
920 bytes per document for 8 fields of sensor data is inflated. A compact representation should be closer to 200-300 bytes.
The Cause
Examining a sample document reveals the problem:
{
"_id": "550e8400-e29b-41d4-a716-446655440000", // UUID as string: 41 bytes
"sensorId": "sensor-00042", // Fine: 15 bytes
"timestamp": "2024-01-15T10:30:00.000Z", // ISO string: 27 bytes
"temperature": "23.5", // Number as string: 8 bytes
"humidity": "65.2", // Number as string: 8 bytes
"pressure": "1013.25", // Number as string: 11 bytes
"isActive": "true", // Boolean as string: 8 bytes
"tags": ["indoor", "floor-3", "zone-a"], // Fine
"metadata": {
"firmwareVersion": "2.1.0",
"lastCalibration": "2024-01-10T00:00:00.000Z", // ISO string: 27 bytes
"batteryLevel": "87" // Number as string: 6 bytes
}
}
Five types are wrong. UUID as string instead of ObjectId or BinData. Timestamps as strings instead of Date. Numbers as strings instead of Double or Int32. Boolean as string instead of Boolean.
The Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 5)
@Measurement(iterations = 5, time = 10)
@Fork(1)
@State(Scope.Benchmark)
public class BsonTypeBenchmark {
private byte[] stringDoc;
private byte[] optimizedDoc;
@Setup
public void setup() {
// Document with string types (the "before" schema)
Document strDoc = new Document()
.append("_id", "550e8400-e29b-41d4-a716-446655440000")
.append("sensorId", "sensor-00042")
.append("timestamp", "2024-01-15T10:30:00.000Z")
.append("temperature", "23.5")
.append("humidity", "65.2")
.append("pressure", "1013.25")
.append("isActive", "true");
stringDoc = toBson(strDoc);
// Document with optimized types (the "after" schema)
Document optDoc = new Document()
.append("_id", new ObjectId())
.append("sensorId", "sensor-00042")
.append("timestamp", new Date())
.append("temperature", 23.5)
.append("humidity", 65.2)
.append("pressure", 1013.25)
.append("isActive", true);
optimizedDoc = toBson(optDoc);
}
@Benchmark
public Document deserializeStringTypes() {
return fromBson(stringDoc);
}
@Benchmark
public Document deserializeOptimizedTypes() {
return fromBson(optimizedDoc);
}
}
Results:
Benchmark Mode Cnt Score Error Units
BsonTypeBenchmark.deserializeStringTypes avgt 5 1850.000 ± 45.000 ns/op
BsonTypeBenchmark.deserializeOptimizedTypes avgt 5 980.000 ± 28.000 ns/op
Optimized types deserialize 1.9x faster. String types require UTF-8 decoding and String object allocation for every field. Native BSON types decode directly into primitive values.
The Fix
Migrate the schema with a bulk update. This is a one-time operation that can run during a maintenance window:
// FAST: Schema migration to optimized BSON types
public void migrateSchemaTypes(MongoCollection<Document> collection) {
int batchSize = 10000;
List<WriteModel<Document>> writes = new ArrayList<>(batchSize);
try (MongoCursor<Document> cursor = collection.find()
.batchSize(batchSize)
.iterator()) {
while (cursor.hasNext()) {
Document doc = cursor.next();
Document update = new Document("$set", new Document()
.append("timestamp", Instant.parse(doc.getString("timestamp")))
.append("temperature", Double.parseDouble(doc.getString("temperature")))
.append("humidity", Double.parseDouble(doc.getString("humidity")))
.append("pressure", Double.parseDouble(doc.getString("pressure")))
.append("isActive", Boolean.parseBoolean(doc.getString("isActive")))
);
writes.add(new UpdateOneModel<>(
Filters.eq("_id", doc.get("_id")),
update
));
if (writes.size() >= batchSize) {
collection.bulkWrite(writes, new BulkWriteOptions().ordered(false));
writes.clear();
}
}
if (!writes.isEmpty()) {
collection.bulkWrite(writes, new BulkWriteOptions().ordered(false));
}
}
}
For new documents, enforce correct types at the application layer:
// FAST: Correct BSON types from the start
Document reading = new Document()
.append("_id", new ObjectId())
.append("sensorId", sensorId)
.append("timestamp", Date.from(Instant.now()))
.append("temperature", temperature) // double, not String
.append("humidity", humidity) // double, not String
.append("pressure", pressure) // double, not String
.append("isActive", isActive); // boolean, not String
The Proof
After migrating 200 million documents:
| Metric | String types | Optimized types | Reduction |
|---|---|---|---|
| Avg document size | 920 bytes | 340 bytes | 63% |
| Collection storage | 180 GB | 68 GB | 62% |
| Index size (timestamp) | 12.4 GB | 4.8 GB | 61% |
| Network per 100-doc query | 92 KB | 34 KB | 63% |
| Deserialization time per doc | 1,850 ns | 980 ns | 47% |
The Trade-off
The migration requires reading and rewriting every document. For 200 million documents with batches of 10,000 and an average batch write time of 200ms, the migration takes approximately 4,000 seconds (67 minutes). During migration, write operations to the collection contend with the bulk updates. Run the migration during low-traffic periods, use ordered(false) to allow parallel execution within each batch, and set a WriteConcern of w:1 during migration for speed (verify replication after completion).
The _id field cannot be changed in place. Migrating from string UUIDs to ObjectId requires creating new documents and deleting old ones, which is a more invasive operation. For existing collections, keep the string _id and fix the other fields first. For new collections, always use ObjectId or BinData.