Implementing RAG: Solving LLM Hallucinations with Retrieval Augmented Generation

The Complete RAG Pipeline

Retrieval Augmented Generation (RAG) provides LLMs access to external documents to prevent factual fabrication. It allows systems to cite exact source passages rather than relying solely on static training data.

Why This Matters

Standard LLMs generate text based on training data, leading to confident but incorrect ‘hallucinations’ when internal or recent company policies are queried. While fine-tuning updates model weights for style and behavior, it is expensive and cannot easily cite sources; RAG solves this by treating the model as a reasoning engine over a dynamic, instantly updatable knowledge base.

Key Insights

RAG vs Fine-Tuning: Use fine-tuning for behavior/style changes and RAG for factual knowledge and frequently changing data.
Chunking Strategy: Paragraph-aware chunking generally preserves semantic units better than fixed-size splitting, with recommended sizes of 300-600 characters.
Vector Indexing: The process involves splitting documents into chunks, converting them into embeddings (e.g., using all-MiniLM-L6-v2), and storing them in a vector database like ChromaDB.
Evaluation Frameworks: Production RAG quality is measured via RAGAS, which automatically evaluates faithfulness, answer relevancy, and context precision.

Working Examples

Sentence-aware chunking implementation to preserve semantic boundaries.

import re
from typing import List

def chunk_by_sentences(text: str, max_chunk_size: int = 500) -> List[str]:
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    current = ""
    for sentence in sentences:
        if len(current) + len(sentence) <= max_chunk_size:
            current += " " + sentence if current else sentence
        else:
            if current:
                chunks.append(current.strip())
            current = sentence
    if current:
        chunks.append(current.strip())
    return chunks

Implementing a full RAG pipeline using LangChain abstractions.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline
docs = [Document(page_content=content, metadata={'source': name}) for name, content in knowledge_base.items()]
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50, separators=['\n\n', '\n', '. ', ' ', ''])
chunks = splitter.split_documents(docs)
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={'k': 3})
gen_pipe = hf_pipeline('text2text-generation', model='google/flan-t5-base', max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=gen_pipe)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, chain_type='stuff', return_source_documents=True)

Practical Applications

References:

https://dev.to/yakhilesh/98-rag-give-your-ai-access-to-your-documents-f3b

On This Page

The Complete RAG Pipeline

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Implementing State-Based AI Workflows with LangGraph Templates

Why I Rolled Back My MCP Skills Experiment: A Lesson in Agent Layer Coordination

Optimizing RAG at Scale: Chunking Strategies, Hybrid Retrieval & Bayesian Search