Semantic Search: Dense Vectors, kNN, and Combining Lexical with Semantic Scoring
Semantic Search
A developer searches for “how to handle errors gracefully” in the documentation platform. The lexical search returns documents containing “handle,” “errors,” and “gracefully.” It misses the comprehensive error handling guide titled “Exception Management and Retry Strategies” because none of the query terms appear in that document. The guide uses “exception,” “fault tolerance,” and “recovery” instead of “handle,” “errors,” and “gracefully.”
Semantic search finds it. A dense vector representation of “how to handle errors gracefully” is close to the vector for “Exception Management and Retry Strategies” because both describe the same concept in different words.
But semantic search also produces confidently wrong results. A search for HttpClient.setConnectionTimeout returns a document about SocketFactory.setKeepAlive because the embedding model considers both to be “configuring network connection parameters.” The user searched for an exact API method name. The semantic model returned a semantically similar but factually wrong result.
The correct architecture uses both. Lexical search handles exact matches, method names, configuration keys, and error messages. Semantic search handles conceptual queries, paraphrased questions, and vocabulary mismatches. Combining them produces better results than either alone. The book proves this with NDCG numbers.
Embedding Generation for Technical Documentation
Embedding models convert text into fixed-dimension vectors. The quality of search depends heavily on model choice:
| Model | Dimensions | Context Window | Technical Domain Performance |
|---|---|---|---|
all-MiniLM-L6-v2 | 384 | 256 tokens | Adequate for short content, poor on long docs |
all-mpnet-base-v2 | 768 | 384 tokens | Good general purpose |
e5-base-v2 | 768 | 512 tokens | Strong on retrieval tasks |
nomic-embed-text-v1.5 | 768 | 8192 tokens | Best for long documentation pages |
For the documentation platform, documents can exceed 5,000 tokens. Models with short context windows truncate the content, losing information. The chunking strategy matters more than the model choice for long documents.
Chunking Strategy
public class DocumentChunker {
private static final int CHUNK_SIZE = 512;
private static final int CHUNK_OVERLAP = 64;
/**
* Chunk a documentation page into overlapping segments.
* Each chunk preserves the document's metadata for retrieval.
*/
public List<DocumentChunk> chunk(DocPage page) {
String fullText = page.title() + "\n\n" + page.body();
List<String> sentences = splitIntoSentences(fullText);
List<DocumentChunk> chunks = new ArrayList<>();
StringBuilder currentChunk = new StringBuilder();
int chunkIndex = 0;
for (String sentence : sentences) {
if (currentChunk.length() + sentence.length() > CHUNK_SIZE
&& !currentChunk.isEmpty()) {
chunks.add(new DocumentChunk(
page.tenantId() + ":" + page.slug() + "#" + chunkIndex,
page.tenantId(),
page.slug(),
page.title(),
currentChunk.toString(),
chunkIndex
));
chunkIndex++;
// Overlap: keep the last few sentences
String overlap = getOverlapText(currentChunk.toString(), CHUNK_OVERLAP);
currentChunk = new StringBuilder(overlap);
}
currentChunk.append(sentence).append(" ");
}
// Final chunk
if (!currentChunk.isEmpty()) {
chunks.add(new DocumentChunk(
page.tenantId() + ":" + page.slug() + "#" + chunkIndex,
page.tenantId(),
page.slug(),
page.title(),
currentChunk.toString(),
chunkIndex
));
}
return chunks;
}
public record DocumentChunk(
String chunkId,
String tenantId,
String parentDocSlug,
String parentDocTitle,
String text,
int chunkIndex
) {}
}
kNN Index Configuration
// HARDENED: Index mapping with knn_vector field for dense vector search
CreateIndexRequest request = CreateIndexRequest.of(idx -> idx
.index("docs-vectors-v1")
.settings(s -> s
.knn(true)
.numberOfShards("3")
.numberOfReplicas("1")
)
.mappings(m -> m
.properties("embedding", p -> p
.knnVector(kv -> kv
.dimension(768)
.method(km -> km
.name("hnsw")
.spaceType("cosinesimil")
.engine("lucene")
.parameters("ef_construction", JsonData.of(256))
.parameters("m", JsonData.of(16))
)
)
)
.properties("tenant_id", pp -> pp.keyword(k -> k))
.properties("parent_doc_slug", pp -> pp.keyword(k -> k))
.properties("parent_doc_title", pp -> pp.text(t -> t.analyzer("standard")))
.properties("chunk_text", pp -> pp.text(t -> t.analyzer("code_analyzer")))
.properties("chunk_index", pp -> pp.integer(i -> i))
)
);
HNSW parameters:
- ef_construction (256): how many neighbors to consider when building the graph. Higher values produce a more accurate graph at the cost of slower indexing. 256 is a good production default.
- m (16): the number of bidirectional links per node. Higher values increase recall but consume more memory. 16 is the sweet spot for most use cases.
Hybrid Scoring with Reciprocal Rank Fusion
RRF combines rankings from multiple retrieval methods without requiring score normalization. Each document’s RRF score is:
$$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}$$
Where $R$ is the set of retrieval methods, $\text{rank}_r(d)$ is document $d$‘s rank in method $r$‘s results, and $k$ is a constant (typically 60) that reduces the impact of very high-ranked documents.
public class HybridSearchService {
private final OpenSearchClient client;
private final EmbeddingService embeddingService;
private static final int RRF_K = 60;
public HybridSearchService(OpenSearchClient client,
EmbeddingService embeddingService) {
this.client = client;
this.embeddingService = embeddingService;
}
public List<HybridResult> hybridSearch(String tenantId, String query, int k)
throws IOException {
// Lexical search
SearchResponse<DocPage> lexicalResults = client.search(s -> s
.index("docs-v1")
.routing(tenantId)
.size(k * 2)
.query(q -> q.bool(b -> b
.filter(f -> f.term(t -> t
.field("tenant_id").value(tenantId)))
.must(mu -> mu.multiMatch(mm -> mm
.query(query)
.fields("title^3", "body", "code_snippets^0.5", "api_method^10")
.type(TextQueryType.CrossFields)
))
)),
DocPage.class
);
// Semantic search
float[] queryVector = embeddingService.embed(query);
SearchResponse<DocumentChunk> semanticResults = client.search(s -> s
.index("docs-vectors-v1")
.size(k * 2)
.query(q -> q.bool(b -> b
.filter(f -> f.term(t -> t
.field("tenant_id").value(tenantId)))
.must(mu -> mu.knn(knn -> knn
.field("embedding")
.vector(queryVector)
.k(k * 2)
))
)),
DocumentChunk.class
);
// Reciprocal Rank Fusion
return reciprocalRankFusion(lexicalResults, semanticResults, k);
}
private List<HybridResult> reciprocalRankFusion(
SearchResponse<DocPage> lexical,
SearchResponse<DocumentChunk> semantic,
int k) {
Map<String, Double> rrfScores = new HashMap<>();
Map<String, String> docTitles = new HashMap<>();
// Add lexical rankings
for (int i = 0; i < lexical.hits().hits().size(); i++) {
var hit = lexical.hits().hits().get(i);
String docId = hit.id();
rrfScores.merge(docId, 1.0 / (RRF_K + i + 1), Double::sum);
if (hit.source() != null) {
docTitles.put(docId, hit.source().title());
}
}
// Add semantic rankings (map chunk IDs back to parent document IDs)
for (int i = 0; i < semantic.hits().hits().size(); i++) {
var hit = semantic.hits().hits().get(i);
if (hit.source() != null) {
String parentDocId = hit.source().tenantId() + ":" +
hit.source().parentDocSlug();
rrfScores.merge(parentDocId, 1.0 / (RRF_K + i + 1), Double::sum);
docTitles.putIfAbsent(parentDocId, hit.source().parentDocTitle());
}
}
return rrfScores.entrySet().stream()
.sorted(Map.Entry.<String, Double>comparingByValue().reversed())
.limit(k)
.map(entry -> new HybridResult(
entry.getKey(),
docTitles.get(entry.getKey()),
entry.getValue()
))
.toList();
}
public record HybridResult(String documentId, String title, double rrfScore) {}
}
The Benchmark
NDCG@5 comparison across retrieval strategies on the 50-query test set:
| Strategy | Overall NDCG@5 | Method Name | Concept | Error Msg | Config Key | How-to |
|---|---|---|---|---|---|---|
| Lexical only (BM25) | 0.77 | 0.89 | 0.71 | 0.72 | 0.81 | 0.65 |
| Semantic only (kNN) | 0.62 | 0.45 | 0.78 | 0.52 | 0.38 | 0.79 |
| Hybrid (RRF) | 0.82 | 0.87 | 0.77 | 0.71 | 0.79 | 0.78 |
The numbers confirm the thesis. Semantic search alone is worse than lexical search for method names (0.45 vs 0.89), config keys (0.38 vs 0.81), and error messages (0.52 vs 0.72). These query categories require exact token matching that vector similarity cannot provide.
Semantic search excels at concept queries (0.78 vs 0.71) and how-to questions (0.79 vs 0.65), where vocabulary mismatch between query and document is common.
Hybrid search with RRF captures the strengths of both: near-best performance on exact match categories (slight degradation from lexical-only) and strong performance on conceptual categories. The overall NDCG@5 of 0.82 exceeds both individual strategies.
The diagram shows three overlapping score distributions for the query “how to handle errors gracefully.” The BM25 distribution has high scores for documents containing the exact query terms and zero for documents using synonymous vocabulary. The kNN distribution is smoother, with non-zero scores for semantically related documents, but it confidently ranks some irrelevant documents high. The RRF distribution combines both, boosting documents that appear in both rankings and properly positioning documents that only one method finds.
The Decision Rule
Implement hybrid search with RRF when the query test set shows that semantic search improves NDCG on at least one query category by more than 0.05 without degrading any category by more than 0.03. For the documentation platform, concept and how-to queries meet this threshold.
Do not use semantic search alone. It consistently fails on exact-match query patterns (method names, configuration keys, error messages) that represent 55% of the documentation platform’s query traffic.
Use RRF over linear score combination. BM25 scores and cosine similarity scores are on different scales and distributions. Normalizing them for linear combination requires per-query statistics. RRF uses ranks, not scores, and is robust to scale differences.