What the Model Actually Does

You type a prompt. Code appears. Between those two events, a specific computational process executes — one that has nothing to do with understanding, reasoning, or software engineering. If you’re going to use this tool daily, you need an accurate picture of that process. Not an ML-researcher-level picture. An engineer-level one. Enough to know what the tool is structurally incapable of doing.

Tokenization: The First Lossy Step

Your prompt enters the model not as words, not as characters, but as tokens — subword units determined by a Byte-Pair Encoding (BPE) algorithm. BPE works by starting with individual characters, then iteratively merging the most frequent adjacent pairs into single tokens. The training corpus determines which merges happen. Common words like the, return, def become single tokens. Uncommon sequences get split.

Here’s what this means concretely. When you type:

connection_pool_size = 20

The tokenizer might break this into:

["connection", "_pool", "_size", " =", " 20"]

Five tokens. The model never sees connection_pool_size as a single concept meaning “the number of connections in a pool.” It sees a sequence of token IDs — integers like [15234, 62891, 41023, 284, 220]. The semantic relationship between “connection” and “pool” exists only as statistical co-occurrence patterns learned during training, not as understood meaning.

This matters because engineers tend to think the model “reads” their prompt the way a human does. It doesn’t. It processes a sequence of integer IDs through mathematical operations. If you name your variable cnnPl_sz, the tokenization changes, the statistical patterns change, and the model’s behavior changes — even though a human would understand both names mean the same thing.

Variable naming affects model output. That’s not a quirk. That’s a direct consequence of how the input is represented.

The Embedding Space: Geometry, Not Semantics

Each token ID maps to a vector in a high-dimensional space — typically 4096 or more dimensions. These embedding vectors are learned during training. Tokens that appear in similar contexts end up closer together geometrically!

The word psycopg2 and the word pg8000 will be relatively close in embedding space because they appear in similar contexts — Python database connection code. The word psycopg2 and the word banana will be far apart. This looks like meaning, and it’s tempting to call it that. It isn’t. It’s compressed co-occurrence statistics. The model doesn’t know that psycopg2 is a PostgreSQL driver or that banana is a fruit. It knows their statistical neighborhoods.

This distinction becomes critical when the model generates code using libraries it has seen frequently versus libraries released after its training cutoff, or niche libraries with few training examples. For well-represented libraries, the statistical patterns are rich and the output looks competent. For underrepresented ones, the patterns are sparse and the output degrades — sometimes generating function signatures that don’t exist, import paths that are wrong, or API patterns from similar but different libraries.

The model doesn’t “know” any API. It has statistical weight distributions that correlate with API patterns from its training data.

Self-Attention: The Context Machine

The architectural core is the transformer’s self-attention mechanism. Here’s what it does in engineering terms.

For each token position, the model computes three vectors from the embedding: a query (what am I looking for?), a key (what do I contain?), and a value (what do I contribute?). The attention score between two positions is the dot product of one position’s query with another position’s key. High dot product means high attention — this position is relevant to that position.

Think of it as a weighted lookup table computed on the fly. When the model is generating code and has seen def connect(self, host, port, database): earlier in the context, the attention mechanism at the current generation step assigns high weights to those function signature tokens, pulling information from them to influence the next token prediction.

The model processes multiple attention patterns simultaneously — “multi-head attention.” Different heads learn to attend to different things. One head might track syntactic structure (matching parentheses, indentation level). Another might track variable references. Another might track imports and their usage. These aren’t explicitly programmed; they emerge from training.

The practical implication: the model’s output is heavily influenced by what’s in its context window — the total sequence of tokens it can attend to. Typical windows are 8K, 32K, 128K, or more tokens. Tokens near the beginning and end of the context tend to get higher attention (a known bias). Tokens in the middle of very long contexts can get “lost.” This is why long conversations with an AI assistant degrade in quality — the early context gets diluted, and the model starts losing track of constraints, requirements, and earlier decisions.

This is also why Retrieval-Augmented Generation (RAG) exists: instead of hoping relevant information survives in a long context, you retrieve the most relevant chunks and inject them close to the prompt, where attention is strongest.

Temperature and Sampling: Rolling Weighted Dice

After attention and feed-forward layers produce an output vector, the model converts it to a probability distribution over its entire vocabulary. For a 50,000-token vocabulary, you get 50,000 probabilities that sum to 1.0. The next token is sampled from this distribution.

The temperature parameter controls the shape of this distribution. Here’s simplified Python that captures the mechanics:

import numpy as np

def sample_next_token(logits, temperature=1.0):
    """
    logits: raw model output scores for each token in vocabulary
    temperature: controls randomness (0.0 = deterministic, >1.0 = more random)
    """
    # Scale logits by temperature
    scaled = logits / max(temperature, 1e-8)
    
    # Softmax: convert to probabilities
    exp_scaled = np.exp(scaled - np.max(scaled))  # numerical stability
    probabilities = exp_scaled / np.sum(exp_scaled)
    
    # Sample from the distribution
    token_id = np.random.choice(len(probabilities), p=probabilities)
    return token_id

# Example: model output scores for 5 tokens
logits = np.array([2.0, 1.5, 0.8, 0.3, -1.0])

# Low temperature: nearly deterministic
# Token 0 gets ~55% probability, token 1 ~37%
sample_next_token(logits, temperature=0.1)

# High temperature: more uniform
# Token 0 gets ~30%, others are closer
sample_next_token(logits, temperature=2.0)

At temperature 0 (or extremely close to 0), the model always picks the highest-probability token. The output is deterministic. At temperature 1.0, sampling follows the learned distribution. At higher temperatures, the distribution flattens — unlikely tokens become more probable, and the output becomes more “creative” (which is a euphemism for “more random”).

Here’s the full generation loop, simplified:

def generate(model, tokenizer, prompt, max_tokens=200, temperature=0.7):
    token_ids = tokenizer.encode(prompt)
    
    for _ in range(max_tokens):
        # Feed current sequence through model
        logits = model.forward(token_ids)
        
        # Get logits for next position only
        next_logits = logits[-1]
        
        # Sample next token
        next_token = sample_next_token(next_logits, temperature)
        
        # Append and continue
        token_ids.append(next_token)
        
        # Stop if we hit end-of-sequence token
        if next_token == tokenizer.eos_token_id:
            break
    
    return tokenizer.decode(token_ids)

That’s the entire generation process. Run the model forward on the current sequence, get a probability distribution, roll the dice, append the result, repeat. There’s no planning step. No “think about the architecture first, then write the code” phase. No backtracking to fix mistakes. Each token is generated left to right, one at a time, based solely on whatever tokens precede it.

When you see an AI “plan” its code by writing comments first, that’s not planning. That’s the model generating comment tokens because comments frequently precede function implementations in training data. The comments then influence subsequent token probabilities, which is why chain-of-thought prompting works — it seeds the context with tokens that make correct subsequent tokens more probable. But the model isn’t thinking. It’s predicting.

The Correctness Impossibility

Here’s the argument distilled to its core.

A next-token predictor produces tokens that are statistically likely given the context. Statistical likelihood is a function of the training data distribution. Code that appears frequently in training data produces strong statistical signals. Code that appears rarely produces weak ones.

Bug-free code and buggy code coexist in the training data. The model assigns probabilities to both. When it generates a connection pool without health checks, it’s not making an error — it’s producing a statistically plausible sequence. Many connection pool implementations in training data lack health checks, because many connection pool implementations in the wild lack health checks. Buggy code is common. The model faithfully reproduces that commonality.

Correctness is a property of how code behaves relative to requirements. The model has no access to your requirements — only to your prompt, which is an incomplete, natural-language approximation of your requirements. It has no ability to verify behavior — it doesn’t execute code, run tests, or check invariants. It produces tokens and moves on.

You might object: “But newer models are getting better! They reason!” What you’re observing is that some newer models implement chain-of-thought as an explicit phase — generating reasoning tokens before answer tokens. This improves output quality on tasks that benefit from sequential decomposition. But the underlying mechanism is identical: next-token prediction. The reasoning tokens are themselves predicted, not derived from logical inference. They’re statistically likely reasoning steps, which are often correct reasoning steps, but “often correct” and “provably correct” are different categories separated by an ocean of production incidents.

What This Means for You

When you use an AI coding assistant, here’s what’s actually happening:

Your prompt is tokenized into integer IDs
Those IDs are embedded into a high-dimensional vector space
Transformer layers apply attention to compute context-weighted representations
A probability distribution over 50,000+ tokens is computed
One token is sampled from that distribution
Steps 2-5 repeat for every single token in the output

No understanding occurs at any step. No verification occurs at any step. No reasoning about your specific system, your specific constraints, your specific failure modes occurs at any step.

The model can’t be “careful.” It can’t be “thorough.” It can’t “double-check its work.” Those are human cognitive operations that have no analog in the architecture. When you tell the model “be careful with error handling,” you’re injecting tokens that increase the probability of error-handling-related tokens in the output. That often produces more error handling code. Whether that error handling is correct for your specific case is not something the architecture can evaluate.

You’re working with a sophisticated autocomplete engine that operates on statistical patterns across an enormous training corpus. That’s genuinely powerful — it means the model has been exposed to more code patterns than any individual engineer will see in a lifetime. But exposure to patterns is not understanding of patterns. And when the pattern doesn’t match your problem — when your constraints differ from the typical case in the training data — the output will look right and be wrong.

That’s the gap you need to fill. Not the model. You.