Skip to main content

On This Page

Understanding LLM API Architecture: Request Patterns, Tokenization, and Cost Optimization

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

An LLM API call, in 4 GIFs

Jasmin Virdi introduces the ‘Building TinyAgent’ series to demystify raw LLM API calls in Node.js. The system reveals that LLM APIs are stateless, requiring the entire message history to be resent for every turn.

Why This Matters

Developers often rely on SDKs that abstract the raw request, leading to production bugs when ignoring the stop_reason or failing to log usage metrics. Because output tokens are significantly more expensive than input tokens and reasoning models bill internal ‘thinking’ as output, a lack of usage logging can lead to unexpected financial spikes—potentially $600/month for a single small feature making 100k calls daily.

Key Insights

  • The stop_reason field is critical for branching logic; ignoring it leads to bugs when responses are truncated by max_tokens or interrupted by tool_use (Virdi, 2026).
  • Tokenization does not follow word boundaries; for example, ‘Unbelievable’ is one word but four tokens (Virdi, 2026).
  • Non-English languages incur higher costs, with Japanese, Hindi, and Arabic typically running 2–4× the token count of English content (Virdi, 2026).
  • Pricing asymmetry exists between inputs and outputs; long prompts are cheap while long responses are roughly 5× more expensive (Virdi, 2026).

Practical Applications

  • Use case: Multi-turn chatbots. Behavior: Maintain a messages array and push every user prompt and model reply back into the next API call.

  • Pitfall: Bloated tool schemas. Consequence: These eat into the input budget on every single request since they are resent with each call.

References:

Continue reading

Next article

Operationalizing AI: Infrastructure, Observability, and Scheduling in Production

Related Content