Gemma 4 E2B Exhibits Configuration-Deterministic Hallucinations at Low Context

What num_ctx=2048 actually produces

Engineer Thehwang conducted a 15-run ablation study on the Gemma 4 E2B model. The tests revealed that at a context window of 2048, the model consistently generates three sequential outputs: a hallucinated summary, a self-disclaimer, and a cautious retry.

Why This Matters

This behavior highlights the gap between ‘trained calibration’ and ‘configuration-deterministic’ artifacts. While the model appears to detect truncated input, the effect only triggers under specific memory constraints (num_ctx=2048) and temperature settings (0.0), rather than being a general semantic capability for detecting damaged data across all configurations.

Key Insights

Configuration over Input: The multi-pass hedge fires specifically at num_ctx=2048 and temperature=0.0, regardless of whether the input is syntactically broken or semantically mid-stream (Thehwang, 2026).
Multi-Pass Response Pattern: The model performs real-time peer review by generating a templated hallucination followed by a ‘Note:’ stating the information is not in the transcript (Example: Gemma 4 E2B via Ollama).
Null Result at High Context: At num_ctx=32768, the model does not hedge on any input shape, including tail-of-document signals or mid-word cuts (Ablation Rows 2, 3, 4, 6).

Working Examples

Harness for replicating the calibration ablation study.

git clone https://github.com/thehwang/Scripta && cd Scripta/benchmarks/calibration-ablation
bash run.sh # rows 2, 3, 4, 6 at num_ctx=32768
NUM_CTX=2048 bash run.sh --rows row1 # the configuration-deterministic case
python3 classify.py > classification-report.md

Practical Applications

)Use case: Gemma 4 E2B via Ollama producing structured meeting summaries under strict context limits.
)Pitfall: Misinterpreting configuration artifacts as general model calibration; leads to overconfident claims about model reliability.

References:

https://dev.to/thehwang/gemma-4-wrote-three-summaries-in-one-response-the-middle-one-was-a-self-disclaimer-3pj9

On This Page

What num_ctx=2048 actually produces

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

MCP vs. CLI: Measuring Token Overhead in Agent Search

The Missing Context Plane: Why Enterprise AI Agents Keep Failing Despite Sound Data Stacks

Laravel AI Agents in Production: Tool Calling Pattern Cuts Chatbot Limit