Semantic Search Engine Built with CocoIndex in 2 Days
These articles are AI-generated summaries. Please check the original sources for full details.
How I Built a Semantic Search Engine with CocoIndex
Linghua Jin built a semantic search engine using CocoIndex, achieving 30-second indexing for 500+ documents and 50ms query responses.
Why This Matters
Traditional keyword-based search fails to capture context, leading to poor user experiences. Semantic search, powered by vector embeddings, bridges this gap but requires efficient infrastructure. CocoIndex demonstrates how lightweight vector storage and embedding models can achieve sub-50ms query times, avoiding the complexity of traditional systems.
Key Insights
- “500+ markdown files indexed in 30 seconds” (Real-World Example)
- “Semantic embeddings allow ‘teaching computers’ to match ‘machine learning’” (Key Features)
- “Batch indexing improves performance for large document collections” (Performance Tips)
Working Example
# Install CocoIndex
pip install cocoindex
from cocoindex import CocoIndex
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
doc_embeddings = data_scope.add_collector()
# Process and chunk documents
with data_scope["documents"].row() as doc:
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500
)
# Embed chunks and export to Postgres
with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Perform semantic search
def search(pool: ConnectionPool, query: str, top_k: int = 5):
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
query_vector = text_to_embedding.eval(query)
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, text, embedding <=> %s::vector AS distance
FROM {table_name} ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
for row in cur.fetchall()
]
Practical Applications
- Use Case: Documentation search with 500+ markdown files using semantic embeddings
- Pitfall: Choosing embedding dimensions without balancing accuracy and performance (e.g., 384 dimensions vs. higher-dimensional models)
References:
Continue reading
Next article
How I Installed Nagios on EC2 and Created My Own Disk Monitoring Plugin
Related Content
Beyond Feature Delivery: How Open Source Redefines Software Engineering Mindsets
Open source contributor Tarunya Kesharwani details how GSoC participation and PR reviews shift engineering focus from basic feature completion to long-term maintainability, highlighting that professional software engineering requires balancing immediate functionality with architectural scalability and collaborative code standards across diverse technology stacks.
Lindy: A Rust-Powered Tool for One-Click Linux Dual-Boot Folder Access
Lindy simplifies Linux dual-booting by automating NTFS partition mounting and folder mapping using a Tauri 2 and Rust-based desktop application.
Hardening Astropy's Core Stability: Testing Raw C-Extensions
Reem Hamraz joins GSoC 2026 to harden Astropy's core stability by implementing low-level tests for Cython extensions.