Skip to main content
unbound mongodb at scale

Denormalization and Atomicity: When to Duplicate Data

2 min read Chapter 25 of 72

Denormalization and Atomicity

MongoDB is not a relational database. The normalization rules that prevent data redundancy in PostgreSQL or MySQL do not apply. In MongoDB, duplicating data across documents is a deliberate design choice that trades storage for read performance. The question is never “should I denormalize?” It is “what is the read/write ratio for this data, and can I afford to update the duplicates?”

Decision matrix for $lookup vs denormalization. X-axis: read/write ratio (1:1 to 1000:1). Y-axis: data volatility (changes per hour). Four quadrants: normalize + $lookup (low reads, high volatility), denormalize + async update (high reads, low volatility), hybrid (moderate both), and event-driven sync (high reads, high volatility).

The Cost of $lookup

The telemetry dashboard shows sensor readings alongside sensor metadata (location, building, floor, sensor type). The normalized schema has two collections:

// readings collection
{ sensorId: "sensor-00042", ts: ISODate("..."), temperature: 23.5, ... }

// sensors collection (metadata)
{ _id: "sensor-00042", building: "HQ", floor: 3, zone: "A", type: "temperature", ... }

The dashboard query joins them with $lookup:

db.readings.aggregate([
  { $match: { sensorId: "sensor-00042", ts: { $gte: startTime } } },
  { $lookup: {
      from: "sensors",
      localField: "sensorId",
      foreignField: "_id",
      as: "sensor"
  }},
  { $unwind: "$sensor" },
  { $project: {
      sensorId: 1, ts: 1, temperature: 1,
      building: "$sensor.building", floor: "$sensor.floor"
  }}
])

This $lookup executes a separate query against the sensors collection for every document in the pipeline. With 500 readings matched, that is 500 secondary queries. MongoDB optimizes this with hash joins when possible, but the overhead is substantial.

The Denormalized Alternative

Embed the sensor metadata directly in each reading:

// readings collection (denormalized)
{
  sensorId: "sensor-00042",
  ts: ISODate("..."),
  temperature: 23.5,
  sensor: { building: "HQ", floor: 3, zone: "A", type: "temperature" }
}

The dashboard query becomes a simple find with no join:

// FAST: No $lookup needed
List<Document> readings = collection.find(
    Filters.and(
        Filters.eq("sensorId", sensorId),
        Filters.gte("ts", Date.from(start))
    )
).projection(Projections.include(
    "sensorId", "ts", "temperature", "sensor.building", "sensor.floor"
)).into(new ArrayList<>());

The trade-off: sensor metadata is duplicated in every reading document. If a sensor moves to a different floor, every reading for that sensor contains stale floor data until updated. With 200 million readings, updating all of them is a multi-hour operation.