Building Semantic Search Engines with Sentence Transformer Embeddings

Build Semantic Search with LLM Embeddings - MachineLearningMastery.com

This technical guide demonstrates the construction of a semantic search pipeline using the SentenceTransformer library and Scikit-Learn. The system processes a 1,000-document subset of the ag_news dataset to enable meaning-based retrieval rather than simple keyword matching.

Why This Matters

Traditional keyword search is inherently rigid, relying on exact word matches and frequently failing to capture semantic nuances like synonyms or alternative phrasing. By utilizing LLM-generated embeddings, developers can build retrieval systems that understand the conceptual intent behind a query, forming the critical architectural foundation for modern Retrieval Augmented Generation (RAG) systems.

Key Insights

Sentence transformer models like ‘all-MiniLM-L6-v2’ translate raw text into high-dimensional numerical vectors to encode latent semantic information.
Cosine similarity serves as the primary distance metric in the NearestNeighbors algorithm to identify the closest semantic matches between query vectors and document vectors.
Semantic search resolves the ‘vocabulary mismatch’ problem where relevant documents are missed because they use different terminology for the same concept.
The ‘ag_news’ dataset provides a standardized benchmark for testing the efficiency of vector-based retrieval pipelines in news categorization tasks.
Modern search architectures use semantic retrieval as a foundational layer to provide relevant context to large language models for RAG workflows.

Working Examples

Initialization of the sentence transformer model and vector index.

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors

# Load dataset and extract text
dataset = load_dataset("ag_news", split="train[:1000]")
documents = dataset["text"]

# Initialize embedding model and encode documents
model = SentenceTransformer("all-MiniLM-L6-v2")
document_embeddings = model.encode(documents, show_progress_bar=True)

# Fit the nearest neighbors engine using cosine similarity
search_engine = NearestNeighbors(n_neighbors=5, metric="cosine")
search_engine.fit(document_embeddings)

The search function that computes query embeddings and retrieves nearest neighbors.

def semantic_search(query, top_k=3):
    # Embed the incoming search query
    query_embedding = model.encode([query])
    
    # Retrieve the closest matches
    distances, indices = search_engine.kneighbors(query_embedding, n_neighbors=top_k)
    
    print(f"\n🔍 Query: '{query}'")
    for i in range(top_k):
        doc_idx = indices[0][i]
        similarity = 1 - distances[0][i]
        print(f"Result {i+1} (Similarity: {similarity:.4f})")
        print(f"Text: {documents[int(doc_idx)][:150]}...\n")

semantic_search("Wall street and stock market trends")

Practical Applications

Use case: Retrieval Augmented Generation (RAG) systems use semantic search to fetch relevant context from a database to ground LLM responses. Pitfall: Relying on exact keyword matching can cause the LLM to miss critical context if the phrasing differs between the query and source.
Use case: Enterprise document recommendation systems use nearest neighbor searches to find content similar to a user’s current reading history. Pitfall: Using Euclidean distance instead of Cosine similarity can lead to poor results when documents have significantly different lengths or vector magnitudes.

References:

https://machinelearningmastery.com/build-semantic-search-with-llm-embeddings/

On This Page

Build Semantic Search with LLM Embeddings - MachineLearningMastery.com

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building Deterministic Graph-RAG Systems Beyond Vector Search

Scaling Semantic Search: A Deep Dive into Vector Database Architectures and ANN Indexing

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings