BM25 vs. Vector Search: Bridging the Gap Between Keywords and Semantics

How BM25 and RAG Retrieve Information Differently?

BM25 serves as the core algorithm for search engines like Elasticsearch and Lucene by scoring documents based on term frequency and rarity. A critical component is the parameter b, typically set to 0.75, which normalizes document length to prevent long documents from unfairly dominating results.

Why This Matters

While BM25 is fast, lightweight, and explainable, it suffers from a bag-of-words limitation where it cannot recognize synonyms or context, such as distinguishing between bank in finance versus geography. In contrast, Vector Search enables semantic matching through dense numerical vectors but incurs higher latency and costs due to API dependencies and GPU requirements. Production-grade systems increasingly adopt hybrid search to mitigate the failure modes of both approaches, as BM25 requires no model or GPU while vector search requires an embedding model at both index and query time.

Key Insights

BM25 uses term frequency saturation controlled by parameter k1 (1.2 to 2.0) to prevent keyword stuffing from inflating relevance scores.
Vector Search converts text into dense 1,536-dimensional vectors using models like OpenAI text-embedding-3-small to measure cosine similarity.
Inverse Document Frequency (IDF) ensures rare terms carry more weight, allowing a match for retrieval in a 10,000-document corpus to be highly significant.
Length normalization in BM25 compares document length to the collection average to ensure fairness across varying text sizes.
Sparse retrieval methods like BM25 rely on exact keyword matches and fail when queries use synonyms or paraphrases not present in the document.

Working Examples

Implementation of BM25 tokenization and indexing using the rank_bm25 library.

import re
from rank_bm25 import BM25Okapi

def tokenize(text: str) -> list[str]:
    return re.findall(r'\w+', text.lower())

tokenized_corpus = [tokenize(chunk) for chunk in CHUNKS]
bm25 = BM25Okapi(tokenized_corpus)

def bm25_search(query: str, top_k: int = 3):
    tokens = tokenize(query)
    scores = bm25.get_scores(tokens)
    return scores

Generating dense vector embeddings and calculating cosine similarity for semantic search.

import numpy as np
from openai import OpenAI

client = OpenAI()
EMBED_MODEL = "text-embedding-3-small"

def get_embedding(text: str) -> np.ndarray:
    response = client.embeddings.create(model=EMBED_MODEL, input=text)
    return np.array(response.data[0].embedding)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

Practical Applications

Hybrid Search Systems: Combining BM25 keyword precision with Vector Search semantic depth to handle both exact matches and paraphrases.
Pitfall: Disabling length normalization (setting b=0) in BM25 leads to long, wordy documents being ranked higher regardless of actual relevance.
Resource Management: Using BM25 for low-latency indexing on local hardware while reserving embedding models for complex semantic queries.
Pitfall: Using different embedding models for the query and the index, which prevents vectors from living in the same semantic space.

References:

https://www.marktechpost.com/2026/03/22/how-bm25-and-rag-retrieve-information-differently/

On This Page

How BM25 and RAG Retrieve Information Differently?

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Optimizing RAG at Scale: Chunking Strategies, Hybrid Retrieval & Bayesian Search

Liquid AI Releases LFM2-ColBERT-350M: A Compact Late Interaction Model for Multilingual Cross-Lingual Retrieval

Creating AI-Ready APIs: Best Practices for Enhancing AI Performance and Reliability