NadirClaw: Building Cost-Aware LLM Routing with Local Prompt Classification
These articles are AI-generated summaries. Please check the original sources for full details.
How to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching
NadirClaw implements an intelligent routing layer that classifies prompts locally before sending them to the most suitable model tier. By utilizing centroid vectors and a local encoder, the system avoids unnecessary high-cost model calls for simple tasks. In live tests, this configuration demonstrated significant cost savings compared to an always-Pro model baseline.
Why This Matters
The technical reality of deploying LLMs involves a constant trade-off between reasoning capability and operational cost. Many production systems default to high-parameter models for every request, which leads to significant financial waste on low-complexity tasks like basic formatting or simple arithmetic. NadirClaw addresses this by introducing a local classification step that ensures only high-complexity reasoning tasks consume expensive ‘Pro’ tier tokens.
By moving the routing decision to a local proxy, developers can maintain a single ‘auto’ model endpoint while benefiting from the speed of lightweight models and the depth of larger ones. This architectural pattern is essential for scaling agentic systems where thousands of intermediate steps may not require full reasoning capabilities, thereby optimizing both latency and budget.
Key Insights
- Local classification via NadirClaw CLI uses JSON output to return routing tier, score, and confidence without making live LLM calls.
- The system utilizes the all-MiniLM-L6-v2 encoder from Sentence-Transformers to generate embeddings for local similarity checks.
- Routing decisions are determined by comparing prompt embeddings against simple_centroid.npy and complex_centroid.npy vectors.
- A default confidence threshold of 0.06 is applied; prompts falling below this threshold are automatically escalated to the complex tier.
- NadirClaw supports modifier-marker scans, identifying ‘agentic’, ‘reasoning’, or ‘vision’ requests based on text markers or request shape.
- Live routing through a local proxy server allows for OpenAI-compatible requests to be dynamically mapped to models like gemini-2.5-flash and gemini-2.5-pro.
Working Examples
Function to locally classify prompts into tiers using the NadirClaw CLI.
import subprocess, json
def classify(prompt: str) -> dict:
r = subprocess.run(
["nadirclaw", "classify", "--format", "json", prompt],
capture_output=True, text=True, timeout=180,
)
if r.returncode != 0:
return {"prompt": prompt, "error": (r.stderr or r.stdout).strip()}
return json.loads(r.stdout.strip())
prompts = ["What is 2+2?", "Refactor the auth module to use dependency injection"]
results = [classify(p) for p in prompts]
Starting the NadirClaw proxy server to handle live model routing.
import os, subprocess
PORT = 8856
env = os.environ.copy()
env.update({
"GEMINI_API_KEY": "YOUR_KEY_HERE",
"NADIRCLAW_SIMPLE_MODEL": "gemini-2.5-flash",
"NADIRCLAW_COMPLEX_MODEL": "gemini-2.5-pro",
"NADIRCLAW_PORT": str(PORT),
})
server_proc = subprocess.Popen(
["nadirclaw", "serve", "--verbose"],
env=env
)
Practical Applications
- Enterprise Chatbots: Routing basic FAQs to Gemini Flash while reserving Gemini Pro for complex architectural or legal inquiries. Pitfall: Using an overly high confidence threshold may cause complex edge cases to fail on simple models.
- Coding Assistants: Detecting ‘agentic’ markers in prompts to ensure code execution tasks are always routed to high-reasoning models. Pitfall: Incorrectly configured environment variables can lead to proxy startup failures, defaulting to a single model.
- Cost Monitoring: Utilizing ‘nadirclaw report’ to analyze JSONL request logs and estimate savings against a fixed-model baseline. Pitfall: Ignoring the request shape (e.g., vision/tools) might bypass the intended routing logic.
References:
Continue reading
Next article
Securing Autonomous Agents: Lessons from a 26/100 Security Audit
Related Content
Google DeepMind Unveils Gemini-Powered AI Mouse Pointer for Context-Aware Computing
Google DeepMind introduces an AI-enabled mouse pointer powered by Gemini that captures visual and semantic context directly at the cursor for streamlined workflows.
Building Persistent Agent-Native Memory with Memori and OpenAI
Learn to implement Memori's agent-native infrastructure to enable persistent context across multi-user sessions in LLM applications using Python and OpenAI.
Top 10 AI Coding Agents of 2026: Claude Code and GPT-5.5 Lead Benchmark Shift
Claude Code leads with 87.6% on SWE-bench Verified while OpenAI pivots to SWE-bench Pro following findings that 59.4% of legacy tasks are flawed or contaminated.