Nested vs Object Fields and the Query Cost of Each

The Symptom

The documentation platform stores metadata about each code example within a document: the language, the framework version it targets, and whether it has been verified. A search for “Java code examples targeting Spring Boot 3.2” returns documents that have Java code examples targeting Spring Boot 2.7 and other examples targeting Spring Boot 3.2 in Python. The fields from different array elements are cross-matched.

The Internals

OpenSearch stores JSON objects in two fundamentally different ways, and the choice determines whether array elements maintain their internal associations.

Object fields flatten nested JSON into dot-notation key-value pairs. Given:

{
  "code_examples": [
    { "language": "java", "framework_version": "3.2" },
    { "language": "python", "framework_version": "2.7" }
  ]
}

OpenSearch internally stores this as:

{
  "code_examples.language": ["java", "python"],
  "code_examples.framework_version": ["3.2", "2.7"]
}

The association between language: java and framework_version: 3.2 is lost. A query for documents where code_examples.language = java AND code_examples.framework_version = 2.7 matches this document, even though no single code example has that combination.

Nested fields store each array element as a hidden Lucene document, maintaining the association between fields within each element. The parent document and its nested documents are stored in the same Lucene block, and a nested query can match against individual array elements independently.

// FRAGILE: Object field for structured array data
// Cross-matching between array elements produces false positives.

.properties("code_examples", p -> p.object(o -> o
    .properties("language", pp -> pp.keyword(k -> k))
    .properties("framework_version", pp -> pp.keyword(k -> k))
    .properties("verified", pp -> pp.boolean_(b -> b))
))

// HARDENED: Nested field preserves per-element associations
// Each code example is queryable independently.

.properties("code_examples", p -> p.nested(n -> n
    .properties("language", pp -> pp.keyword(k -> k))
    .properties("framework_version", pp -> pp.keyword(k -> k))
    .properties("verified", pp -> pp.boolean_(b -> b))
))

The Implementation

Querying nested fields requires the nested query wrapper:

// HARDENED: Nested query targeting a specific array element combination
SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .query(q -> q
        .bool(b -> b
            .must(mu -> mu.match(m -> m.field("body").query(userQuery)))
            .filter(f -> f
                .nested(n -> n
                    .path("code_examples")
                    .query(nq -> nq
                        .bool(nb -> nb
                            .must(nm -> nm.term(t -> t
                                .field("code_examples.language").value("java")))
                            .must(nm -> nm.term(t -> t
                                .field("code_examples.framework_version").value("3.2")))
                        )
                    )
                )
            )
        )
    )
);

/*
Equivalent JSON:
{
  "query": {
    "bool": {
      "must": { "match": { "body": "user query" } },
      "filter": {
        "nested": {
          "path": "code_examples",
          "query": {
            "bool": {
              "must": [
                { "term": { "code_examples.language": "java" } },
                { "term": { "code_examples.framework_version": "3.2" } }
              ]
            }
          }
        }
      }
    }
  }
}
*/

The Measurement

The hidden cost of nested documents is in document count. Each nested object creates a hidden Lucene document. A documentation page with 15 code examples produces 16 Lucene documents (1 parent + 15 nested). An index with 100,000 documentation pages, each averaging 10 code examples, contains 1,100,000 Lucene documents, not 100,000.

Metric	Object Field	Nested Field (10 elements avg)
Lucene doc count	100,000	1,100,000
Segment size	~5GB	~8GB
Simple match query latency	12ms	14ms
Nested filter query latency	N/A	22ms
Heap per shard (field data)	200MB	350MB

The nested query adds approximately 8-10ms of latency because it must join the nested document matches back to their parent documents. The segment size increase is proportional to the number of nested elements.

The Decision Rule

Use nested when array elements have multiple fields that must be queried in combination and false cross-matches would produce incorrect results. The code examples use case is a clear fit: users filter by language AND framework version, and cross-matches return wrong results.

Use object when array elements have a single field or when cross-matching is acceptable. A tags array of strings, for example, does not need nested because there is no internal structure to cross-match.

Avoid nested fields when the average number of nested elements per document exceeds 50. At that scale, the hidden document count inflates segment sizes and query latency beyond what most applications can tolerate. Consider denormalizing into a keyword array with concatenated values (e.g., "java:3.2") as an alternative.