Draft / Scheduled Content
This article is a draft or scheduled for future publication. The content is subject to change.
Codexity Part 6: Small Model Inference with llama-cpp-python
Codexity Part 6: Small Model Inference with llama-cpp-python
The context is built. Ten chunks of web content, tagged with source numbers, waiting to be synthesized into a coherent answer. The synthesizer needs to read that context, understand the user’s question, write a clear response, and cite its sources correctly.
GPT-4 does this trivially. A 7B model needs more coaxing.
Choosing a Model
Four models work well for Codexity. Each has a different strength:
Qwen2.5-7B-Instruct produces the highest quality answers. Citation accuracy is consistently above 90%. Longer answers tend to be well-structured with clear paragraphs. Downside: slightly slower than Mistral at the same quantization level.
Mistral-7B-Instruct-v0.3 generates faster and handles shorter, factual queries well. Citation accuracy drops to around 80% for complex multi-source answers. Good for a speed-focused setup.
Phi-3.5-mini (3.8B) runs on 4GB of RAM. Quality is noticeably lower than the 7B models, but for simple factual queries, the difference is small. Use this if you are running on a laptop with limited memory.
Llama-3.1-8B-Instruct sits between Qwen and Mistral. Handles conversational queries well. Slightly worse at structured output.
For this series, we use Qwen2.5-7B. Download the GGUF file:
mkdir -p models
# Download Q4_K_M quantization (~4.5GB)
wget -O models/qwen2.5-7b-instruct-q4_k_m.gguf \
"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"
Q4_K_M is the sweet spot for quality vs. size. Q5_K_M gains marginal quality at 20% more memory usage. Q3 and below degrade noticeably on citation tasks.
The Synthesizer
# synthesizer.py
from llm_client import generate_streaming
from models import SourceReference
SYSTEM_PROMPT = """You are a search assistant that answers questions using provided sources.
Rules:
- Base your answer ONLY on the provided sources
- Cite sources using [1], [2], etc. matching the source numbers
- Every factual claim must have a citation
- If sources disagree, mention both perspectives with their citations
- Write clear, direct paragraphs
- Do not make up information not in the sources
- If the sources do not contain enough information, say so"""
def build_synthesis_prompt(query: str, context: str, sources: list[SourceReference]) -> str:
source_list = "\n".join(
f"[{s.index}] {s.title} ({s.url})" for s in sources
)
return f"""Sources:
{source_list}
Context:
{context}
Question: {query}
Answer:"""
async def synthesize(
query: str,
context: str,
sources: list[SourceReference],
):
"""Generate an answer with citations, yielding tokens as they stream."""
prompt = build_synthesis_prompt(query, context, sources)
async for token in generate_streaming(
prompt=prompt,
system=SYSTEM_PROMPT,
max_tokens=2048,
):
yield token
The prompt structure matters. Sources first (with URLs for reference), then the context chunks, then the question. This ordering gives the model the citation numbers before it encounters the text, making it easier to reference them in the answer.
Making Small Models Cite Correctly
Small models struggle with citations. A 7B model told to “cite sources as [1], [2]” will sometimes:
- Invent citations that do not exist (
[7]when there are only 5 sources) - Cite the wrong source for a claim
- Cite at the end of the answer instead of inline
- Skip citations entirely
Three techniques improve citation accuracy significantly.
1. Source Numbers in the Context
The context already includes [Source 1], [Source 2] labels. This anchors the citation numbers in the model’s attention. Without these labels, the model has to track which chunk came from which source, and 7B models lose that mapping beyond 3-4 sources.
2. Few-Shot Examples (Optional)
For models that still struggle, add a single example to the system prompt:
SYSTEM_PROMPT_WITH_EXAMPLE = """You are a search assistant. Cite sources as [1], [2].
Example:
Question: What is FastAPI?
Answer: FastAPI is a modern Python web framework built on Starlette and Pydantic [1]. It supports async request handling natively and generates OpenAPI documentation automatically [2].
Now answer the following question using the provided sources."""
One example is enough. More than two wastes context tokens.
3. Post-Processing
Even with good prompting, validate citations in the output:
import re
def validate_citations(text: str, max_source: int) -> str:
"""Remove citations that reference non-existent sources."""
def replace_invalid(match):
num = int(match.group(1))
if 1 <= num <= max_source:
return match.group(0)
return "" # Remove invalid citation
return re.sub(r'\[(\d+)\]', replace_invalid, text)
This catches hallucinated citations like [12] when you only have 6 sources. The function runs on the complete answer after streaming finishes.
Running llama-cpp-python
llama-cpp-python wraps llama.cpp, the C++ inference engine for GGUF models. It exposes a Python API and an optional OpenAI-compatible HTTP server.
For Codexity, we use the Python API directly (no HTTP server). The model loads into memory once and stays resident.
# llm_client.py (complete version)
import asyncio
from llama_cpp import Llama
from config import settings
_llm: Llama | None = None
def get_llm() -> Llama:
global _llm
if _llm is None:
_llm = Llama(
model_path=settings.model_path,
n_ctx=settings.context_length,
n_threads=4,
n_gpu_layers=0, # Set to -1 for full GPU offload
verbose=False,
)
return _llm
def generate(prompt: str, max_tokens: int = 512, temperature: float = 0.1) -> str:
llm = get_llm()
response = llm.create_chat_completion(
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
)
return response["choices"][0]["message"]["content"]
async def generate_streaming(prompt: str, system: str = "", max_tokens: int = 2048):
llm = get_llm()
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
loop = asyncio.get_event_loop()
for chunk in llm.create_chat_completion(
messages=messages,
max_tokens=max_tokens,
temperature=0.3,
stream=True,
top_p=0.9,
repeat_penalty=1.1,
):
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
yield delta["content"]
# Yield control to the event loop between tokens
await asyncio.sleep(0)
The await asyncio.sleep(0) after each token is important. llama.cpp runs synchronously in C++. Without the sleep, the event loop blocks during the entire generation, and SSE events cannot be pushed to the client. The zero-sleep yield gives the event loop a chance to flush buffered events.
repeat_penalty=1.1 reduces the model’s tendency to repeat phrases. Without it, small models sometimes loop on sentences like “as mentioned in source [1], as mentioned in source [1]…”
GPU Acceleration
If you have an NVIDIA GPU:
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Set n_gpu_layers=-1 in the Llama constructor to offload all layers to the GPU. A 7B Q4 model on an RTX 3060 generates at ~50 tokens/second, compared to ~8 tokens/second on CPU. The difference is dramatic for user experience.
For Apple Silicon, llama.cpp uses Metal by default. No extra flags needed.
Memory Usage
| Model | Quantization | RAM Usage |
|---|---|---|
| Qwen2.5-7B | Q4_K_M | ~4.5 GB |
| Qwen2.5-7B | Q5_K_M | ~5.5 GB |
| Phi-3.5-mini | Q4_K_M | ~2.5 GB |
| Llama-3.1-8B | Q4_K_M | ~5.0 GB |
Add 1-2 GB for the context window (KV cache). A machine with 8 GB of RAM can run Phi-3.5. For the 7B models, 16 GB is comfortable.
Plugging Into the Pipeline
from synthesizer import synthesize
from models import SearchEvent
async def search_pipeline(query: str):
# ... Phase 1-4 ...
# Phase 5: Synthesize
yield SearchEvent(event="status", data={"step": "generating"})
full_answer = ""
async for token in synthesize(query, context, sources):
yield SearchEvent(event="token", data={"text": token})
full_answer += token
# Send final sources with the complete answer
yield SearchEvent(
event="answer_complete",
data={
"sources": [
{"index": s.index, "title": s.title, "url": s.url}
for s in sources
],
},
)
yield SearchEvent(event="done", data={})
Each token generates an SSE event. The client receives the answer character by character, exactly like ChatGPT or Perplexity.
Output Quality
With Qwen2.5-7B and well-processed context, the output for “postgres vs mongo for startups” looks like:
PostgreSQL and MongoDB serve different architectural needs, and the choice depends on your data model.
PostgreSQL excels when your data has clear relationships. It supports ACID transactions, complex joins, and its JSONB column type provides document-store capabilities without sacrificing relational features [1]. For startups that expect their schema to stabilize, PostgreSQL avoids the technical debt of eventual schema migrations [3].
MongoDB offers faster initial development when your schema is still evolving. Document storage maps naturally to application objects, and schema changes require no migrations [2]. The trade-off is weaker transaction support and the need for manual data consistency in complex operations [3].
For most startups in 2026, PostgreSQL with JSONB provides flexibility comparable to MongoDB while keeping relational capabilities available when needed [1][3].
Inline citations. Multiple perspectives. A clear recommendation. From a model that fits in 4.5 GB.
What Comes Next
Part 7 implements the SSE layer properly. The token events work, but we need error handling, connection management, heartbeats, and proper HTTP headers. We also need to handle clients that disconnect mid-stream without crashing the pipeline.
Related Content
Codexity Part 2: Query Rewriting with LLMs
A user types a vague question. The query rewriter transforms it into targeted search queries using a local LLM. We cover intent classification, query decomposition, and prompt engineering that actually works with small models.
Codexity Part 1: Architecture of an Answer Engine
The first chapter in a series on building a Perplexity-style answer engine from scratch in Python. We lay out the full architecture, set up the project skeleton, and understand every component before writing a single line of business logic.
Codexity Part 5: Content Processing and Relevance Ranking
Take raw scraped text from 12 web pages and transform it into a focused context window for an LLM. Chunk text, score relevance with BM25, select the best fragments, and format them with source citations.