Skip to main content

On This Page

Draft / Scheduled Content

This article is a draft or scheduled for future publication. The content is subject to change.

Codexity Part 6: Small Model Inference with llama-cpp-python

8 min read
Share

Codexity Part 6: Small Model Inference with llama-cpp-python

The context is built. Ten chunks of web content, tagged with source numbers, waiting to be synthesized into a coherent answer. The synthesizer needs to read that context, understand the user’s question, write a clear response, and cite its sources correctly.

GPT-4 does this trivially. A 7B model needs more coaxing.

Small Model Inference Stack

Choosing a Model

Four models work well for Codexity. Each has a different strength:

Qwen2.5-7B-Instruct produces the highest quality answers. Citation accuracy is consistently above 90%. Longer answers tend to be well-structured with clear paragraphs. Downside: slightly slower than Mistral at the same quantization level.

Mistral-7B-Instruct-v0.3 generates faster and handles shorter, factual queries well. Citation accuracy drops to around 80% for complex multi-source answers. Good for a speed-focused setup.

Phi-3.5-mini (3.8B) runs on 4GB of RAM. Quality is noticeably lower than the 7B models, but for simple factual queries, the difference is small. Use this if you are running on a laptop with limited memory.

Llama-3.1-8B-Instruct sits between Qwen and Mistral. Handles conversational queries well. Slightly worse at structured output.

For this series, we use Qwen2.5-7B. Download the GGUF file:

mkdir -p models
# Download Q4_K_M quantization (~4.5GB)
wget -O models/qwen2.5-7b-instruct-q4_k_m.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"

Q4_K_M is the sweet spot for quality vs. size. Q5_K_M gains marginal quality at 20% more memory usage. Q3 and below degrade noticeably on citation tasks.

The Synthesizer

# synthesizer.py
from llm_client import generate_streaming
from models import SourceReference

SYSTEM_PROMPT = """You are a search assistant that answers questions using provided sources.

Rules:
- Base your answer ONLY on the provided sources
- Cite sources using [1], [2], etc. matching the source numbers
- Every factual claim must have a citation
- If sources disagree, mention both perspectives with their citations
- Write clear, direct paragraphs
- Do not make up information not in the sources
- If the sources do not contain enough information, say so"""

def build_synthesis_prompt(query: str, context: str, sources: list[SourceReference]) -> str:
    source_list = "\n".join(
        f"[{s.index}] {s.title} ({s.url})" for s in sources
    )

    return f"""Sources:
{source_list}

Context:
{context}

Question: {query}

Answer:"""

async def synthesize(
    query: str,
    context: str,
    sources: list[SourceReference],
):
    """Generate an answer with citations, yielding tokens as they stream."""
    prompt = build_synthesis_prompt(query, context, sources)

    async for token in generate_streaming(
        prompt=prompt,
        system=SYSTEM_PROMPT,
        max_tokens=2048,
    ):
        yield token

The prompt structure matters. Sources first (with URLs for reference), then the context chunks, then the question. This ordering gives the model the citation numbers before it encounters the text, making it easier to reference them in the answer.

Making Small Models Cite Correctly

Small models struggle with citations. A 7B model told to “cite sources as [1], [2]” will sometimes:

  • Invent citations that do not exist ([7] when there are only 5 sources)
  • Cite the wrong source for a claim
  • Cite at the end of the answer instead of inline
  • Skip citations entirely

Three techniques improve citation accuracy significantly.

1. Source Numbers in the Context

The context already includes [Source 1], [Source 2] labels. This anchors the citation numbers in the model’s attention. Without these labels, the model has to track which chunk came from which source, and 7B models lose that mapping beyond 3-4 sources.

2. Few-Shot Examples (Optional)

For models that still struggle, add a single example to the system prompt:

SYSTEM_PROMPT_WITH_EXAMPLE = """You are a search assistant. Cite sources as [1], [2].

Example:
Question: What is FastAPI?
Answer: FastAPI is a modern Python web framework built on Starlette and Pydantic [1]. It supports async request handling natively and generates OpenAPI documentation automatically [2].

Now answer the following question using the provided sources."""

One example is enough. More than two wastes context tokens.

3. Post-Processing

Even with good prompting, validate citations in the output:

import re

def validate_citations(text: str, max_source: int) -> str:
    """Remove citations that reference non-existent sources."""
    def replace_invalid(match):
        num = int(match.group(1))
        if 1 <= num <= max_source:
            return match.group(0)
        return ""  # Remove invalid citation

    return re.sub(r'\[(\d+)\]', replace_invalid, text)

This catches hallucinated citations like [12] when you only have 6 sources. The function runs on the complete answer after streaming finishes.

Running llama-cpp-python

llama-cpp-python wraps llama.cpp, the C++ inference engine for GGUF models. It exposes a Python API and an optional OpenAI-compatible HTTP server.

For Codexity, we use the Python API directly (no HTTP server). The model loads into memory once and stays resident.

# llm_client.py (complete version)
import asyncio
from llama_cpp import Llama

from config import settings

_llm: Llama | None = None

def get_llm() -> Llama:
    global _llm
    if _llm is None:
        _llm = Llama(
            model_path=settings.model_path,
            n_ctx=settings.context_length,
            n_threads=4,
            n_gpu_layers=0,  # Set to -1 for full GPU offload
            verbose=False,
        )
    return _llm

def generate(prompt: str, max_tokens: int = 512, temperature: float = 0.1) -> str:
    llm = get_llm()
    response = llm.create_chat_completion(
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=temperature,
    )
    return response["choices"][0]["message"]["content"]

async def generate_streaming(prompt: str, system: str = "", max_tokens: int = 2048):
    llm = get_llm()
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})

    loop = asyncio.get_event_loop()

    for chunk in llm.create_chat_completion(
        messages=messages,
        max_tokens=max_tokens,
        temperature=0.3,
        stream=True,
        top_p=0.9,
        repeat_penalty=1.1,
    ):
        delta = chunk["choices"][0].get("delta", {})
        if "content" in delta:
            yield delta["content"]
            # Yield control to the event loop between tokens
            await asyncio.sleep(0)

The await asyncio.sleep(0) after each token is important. llama.cpp runs synchronously in C++. Without the sleep, the event loop blocks during the entire generation, and SSE events cannot be pushed to the client. The zero-sleep yield gives the event loop a chance to flush buffered events.

repeat_penalty=1.1 reduces the model’s tendency to repeat phrases. Without it, small models sometimes loop on sentences like “as mentioned in source [1], as mentioned in source [1]…”

GPU Acceleration

If you have an NVIDIA GPU:

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Set n_gpu_layers=-1 in the Llama constructor to offload all layers to the GPU. A 7B Q4 model on an RTX 3060 generates at ~50 tokens/second, compared to ~8 tokens/second on CPU. The difference is dramatic for user experience.

For Apple Silicon, llama.cpp uses Metal by default. No extra flags needed.

Memory Usage

ModelQuantizationRAM Usage
Qwen2.5-7BQ4_K_M~4.5 GB
Qwen2.5-7BQ5_K_M~5.5 GB
Phi-3.5-miniQ4_K_M~2.5 GB
Llama-3.1-8BQ4_K_M~5.0 GB

Add 1-2 GB for the context window (KV cache). A machine with 8 GB of RAM can run Phi-3.5. For the 7B models, 16 GB is comfortable.

Plugging Into the Pipeline

from synthesizer import synthesize
from models import SearchEvent

async def search_pipeline(query: str):
    # ... Phase 1-4 ...

    # Phase 5: Synthesize
    yield SearchEvent(event="status", data={"step": "generating"})

    full_answer = ""
    async for token in synthesize(query, context, sources):
        yield SearchEvent(event="token", data={"text": token})
        full_answer += token

    # Send final sources with the complete answer
    yield SearchEvent(
        event="answer_complete",
        data={
            "sources": [
                {"index": s.index, "title": s.title, "url": s.url}
                for s in sources
            ],
        },
    )
    yield SearchEvent(event="done", data={})

Each token generates an SSE event. The client receives the answer character by character, exactly like ChatGPT or Perplexity.

Output Quality

With Qwen2.5-7B and well-processed context, the output for “postgres vs mongo for startups” looks like:

PostgreSQL and MongoDB serve different architectural needs, and the choice depends on your data model.

PostgreSQL excels when your data has clear relationships. It supports ACID transactions, complex joins, and its JSONB column type provides document-store capabilities without sacrificing relational features [1]. For startups that expect their schema to stabilize, PostgreSQL avoids the technical debt of eventual schema migrations [3].

MongoDB offers faster initial development when your schema is still evolving. Document storage maps naturally to application objects, and schema changes require no migrations [2]. The trade-off is weaker transaction support and the need for manual data consistency in complex operations [3].

For most startups in 2026, PostgreSQL with JSONB provides flexibility comparable to MongoDB while keeping relational capabilities available when needed [1][3].

Inline citations. Multiple perspectives. A clear recommendation. From a model that fits in 4.5 GB.

What Comes Next

Part 7 implements the SSE layer properly. The token events work, but we need error handling, connection management, heartbeats, and proper HTTP headers. We also need to handle clients that disconnect mid-stream without crashing the pipeline.

Related Content