Text Generation — Making Your Model Talk

In the previous chapter, you trained a GPT model. You fed it millions of tokens, computed losses, backpropagated gradients, and watched the loss go down. Your model’s weights are no longer random — they encode patterns of language learned from real text.

But so far, you’ve only ever asked the model to predict the next token for existing sequences in the training data. You’ve never asked it to create new text from scratch. That’s what generation is: giving the model a prompt like “The weather today is” and letting it continue with “sunny and warm, perfect for a walk in the park.”

This is the moment the model finally talks.

Generation turns out to be surprisingly simple in concept, but full of nuance in practice. The basic loop is just three steps repeated over and over. But how you choose each next word — greedily? randomly? somewhere in between? — dramatically affects the quality of the output. This chapter covers every major strategy used by production LLMs, from ChatGPT to Claude to Llama.

By the end of this chapter, you will:

Understand autoregressive generation (one token at a time)
Implement greedy decoding and see its failure modes
Use temperature to control randomness
Implement top-k sampling to filter unlikely tokens
Implement top-p (nucleus) sampling for adaptive selection
Combine all strategies into a single generation function
Add KV-caching to make generation efficient
Apply repetition penalties to reduce looping
Generate coherent text from your trained model

Let’s make the model talk.

1. How Generation Works

Here’s the core idea: a language model generates text one token at a time. It predicts the next word, appends it to the input, then predicts the next word after that, and so on. This process is called autoregressive generation — each new token depends on all the tokens that came before it.

Think of the autocomplete on your phone. You type “I’m going to the” and your keyboard suggests “store,” “gym,” or “park.” You tap “park” and now the context is “I’m going to the park” — and it suggests new words based on this longer phrase. You could keep tapping suggestions and produce an entire sentence without typing a single letter.

That’s exactly how a language model generates text.

The Autoregressive Loop

Here’s the loop spelled out:

Step 1: Input  = "The weather today"
        Model predicts → "is"

Step 2: Input  = "The weather today is"
        Model predicts → "sunny"

Step 3: Input  = "The weather today is sunny"
        Model predicts → "and"

Step 4: Input  = "The weather today is sunny and"
        Model predicts → "warm"

...and so on until we decide to stop.

Each step, the model sees the full sequence so far (including all previously generated tokens), computes logits for every word in the vocabulary, and then we pick one word to append. The critical question — and the subject of this entire chapter — is: how do we pick that one word?

Let’s set up the code. We’ll assume you have a trained model and tokenizer from Chapter 7:

import torch
import torch.nn.functional as F

# Assume these are loaded from Chapter 7
# model = GPTModel(config)       # your trained model
# tokenizer = ...                # your BPE tokenizer from Chapter 3

def generate_next_token_logits(model, token_ids):
    """
    Feed token_ids through the model and return logits for the NEXT token.

    Args:
        model: trained GPT model
        token_ids: tensor of shape (1, seq_len)

    Returns:
        logits: tensor of shape (vocab_size,) — scores for each possible next word
    """
    model.eval()  # Turn off dropout
    with torch.no_grad():  # No gradients needed during generation
        logits = model(token_ids)         # (1, seq_len, vocab_size)
        next_token_logits = logits[0, -1, :]  # Take the LAST position
    return next_token_logits

Why the last position? Because of causal masking. The model’s prediction at position i is based on tokens 0 through i. So the prediction at the last position is the one that sees the entire input — that’s the prediction for the next word.

2. Greedy Decoding

The simplest strategy: always pick the word with the highest probability. Just take the argmax of the logits.

def greedy_decode(model, tokenizer, prompt, max_new_tokens=50):
    """
    Generate text by always picking the most probable next token.

    Args:
        model: trained GPT model
        tokenizer: tokenizer with encode() and decode()
        prompt: string to start generation from
        max_new_tokens: how many new tokens to generate
    """
    # Encode the prompt into token IDs
    token_ids = tokenizer.encode(prompt)
    token_ids = torch.tensor([token_ids], dtype=torch.long)  # (1, seq_len)

    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Get logits for the next token
            logits = model(token_ids)              # (1, seq_len, vocab_size)
            next_logits = logits[0, -1, :]         # (vocab_size,)

            # Pick the highest-scoring token
            next_token = torch.argmax(next_logits)  # scalar

            # Append it to the sequence
            next_token = next_token.unsqueeze(0).unsqueeze(0)  # (1, 1)
            token_ids = torch.cat([token_ids, next_token], dim=1)

    # Decode the full sequence back to text
    generated_text = tokenizer.decode(token_ids[0].tolist())
    return generated_text

Let’s trace through what happens:

# Example usage (assuming a trained model)
output = greedy_decode(model, tokenizer, "The weather today is", max_new_tokens=20)
print(output)
# Possible output: "The weather today is sunny and warm and sunny and warm and sunny and warm and sunny and warm and sunny"

The Problem with Greedy Decoding

Did you notice the repetition? “sunny and warm and sunny and warm…” That’s the fundamental problem with greedy decoding. Once the model enters a high-probability loop, it gets stuck. It keeps predicting the same sequence because, at each step, those words genuinely are the most probable next word.

This isn’t a bug in the model — it’s a bug in the decoding strategy. The most probable word at each step doesn’t lead to the most probable sentence overall. Greedy decoding is locally optimal but globally bad.

Think of it like navigating a city. Always taking the widest road at every intersection might lead you in circles around the city center, never reaching your destination. Sometimes you need to take a smaller road to get somewhere interesting.

Human language is full of surprises. Good writing isn’t just a sequence of the most predictable words — that would read like a corporate memo written by committee. We need strategies that allow some randomness, some creativity, while still keeping the text coherent.

3. Temperature

The first tool for controlling randomness is temperature. It’s beautifully simple: before applying softmax to convert logits into probabilities, divide all logits by a temperature value T.

$$P(w_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$

Where $z_i$ is the logit for word $i$ and $T$ is the temperature.

The effect:

T = 1.0 — Normal softmax. No change. The default.
T < 1.0 — Dividing by a small number makes large logits even larger and small logits even smaller. The distribution becomes sharper — the model becomes more confident, more predictable.
T > 1.0 — Dividing by a large number compresses all logits toward zero. The distribution becomes flatter — all words become more equally likely. More random, more creative.

Think of it as a dial. Turn it toward 0 and you get a strict librarian who always picks the “correct” word. Turn it toward infinity and you get a wild poet throwing words together based on loose associations. The sweet spot is somewhere in between.

Implementation

def apply_temperature(logits, temperature=1.0):
    """
    Scale logits by temperature before softmax.

    Args:
        logits: raw model output scores, shape (vocab_size,)
        temperature: float > 0. Lower = more confident, higher = more random.

    Returns:
        probabilities: shape (vocab_size,)
    """
    if temperature <= 0:
        raise ValueError("Temperature must be positive")

    scaled_logits = logits / temperature
    probabilities = F.softmax(scaled_logits, dim=-1)
    return probabilities

Seeing the Effect

Let’s look at the same logits with different temperatures:

# Suppose the model outputs these logits for the next token
# (higher = model thinks this word is more likely)
logits = torch.tensor([5.0, 3.0, 2.0, 1.0, 0.5])
words = ["sunny", "cloudy", "rainy", "windy", "foggy"]

for temp in [0.5, 1.0, 1.5]:
    probs = apply_temperature(logits, temperature=temp)
    print(f"\nTemperature = {temp}")
    for word, prob in zip(words, probs):
        bar = "█" * int(prob * 50)
        print(f"  {word:8s}: {prob:.3f} {bar}")

Output:

Temperature = 0.5
  sunny   : 0.888 ████████████████████████████████████████████
  cloudy  : 0.089 ████
  rainy   : 0.018 
  windy   : 0.004 
  foggy   : 0.002 

Temperature = 1.0
  sunny   : 0.580 █████████████████████████████
  cloudy  : 0.213 ██████████
  rainy   : 0.078 ███
  windy   : 0.029 █
  foggy   : 0.017 

Temperature = 1.5
  sunny   : 0.402 ████████████████████
  cloudy  : 0.244 ████████████
  rainy   : 0.148 ███████
  windy   : 0.090 ████
  foggy   : 0.066 ███

At T=0.5, “sunny” dominates with 89% — almost guaranteed to be picked. At T=1.5, the probabilities are much more evenly spread — “cloudy” and “rainy” have a real chance of being selected. This is how you move from boring, repetitive text to more creative, varied output.

Sampling with Temperature

Now let’s use temperature in our generation loop. Instead of argmax (greedy), we sample from the distribution:

def sample_with_temperature(model, tokenizer, prompt, max_new_tokens=50, temperature=1.0):
    """Generate text by sampling with temperature scaling."""
    token_ids = tokenizer.encode(prompt)
    token_ids = torch.tensor([token_ids], dtype=torch.long)

    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(token_ids)
            next_logits = logits[0, -1, :]

            # Apply temperature and get probabilities
            probs = apply_temperature(next_logits, temperature)

            # Sample from the distribution instead of argmax
            next_token = torch.multinomial(probs, num_samples=1)  # (1,)

            token_ids = torch.cat([token_ids, next_token.unsqueeze(0)], dim=1)

    return tokenizer.decode(token_ids[0].tolist())

torch.multinomial is the key function here. Instead of always taking the most probable token, it randomly samples a token, where each token’s chance of being picked is proportional to its probability. “sunny” at 58% might get picked most often, but “cloudy” at 21% will get picked roughly one in five times.

4. Top-K Sampling

Temperature controls how random the sampling is, but it still considers every single word in the vocabulary. If your vocabulary has 50,000 words, even a very low-probability word — say, “xylophone” after “The weather today is” — has some tiny chance of being selected. That tiny chance, multiplied across many tokens, can produce nonsense.

Top-k sampling fixes this by only considering the K most probable words, then sampling from those. Every other word’s probability is set to zero.

Think of it this way: you’re at a restaurant with a 200-item menu. You’d never order the “deep-fried shoe.” But if someone randomly pointed at a menu item, there’s a small chance they’d land on it. Top-k is like ripping off the bottom 190 items and only showing you the top 10. Now every option is reasonable.

Implementation

def top_k_sampling(logits, k=50):
    """
    Zero out all logits except the top-k highest, then return probabilities.

    Args:
        logits: raw scores, shape (vocab_size,)
        k: number of top tokens to keep

    Returns:
        probabilities: shape (vocab_size,) with only k non-zero entries
    """
    # Find the k-th largest value as a threshold
    top_k_values, _ = torch.topk(logits, k)
    threshold = top_k_values[-1]  # The smallest value in the top-k

    # Set everything below the threshold to -infinity
    # (so softmax turns them into 0)
    filtered_logits = logits.clone()
    filtered_logits[logits < threshold] = float('-inf')

    # Convert to probabilities
    probabilities = F.softmax(filtered_logits, dim=-1)
    return probabilities

How K Affects Output

# Same logits, different K values
logits = torch.tensor([5.0, 3.0, 2.0, 1.0, 0.5, -1.0, -2.0, -5.0])
words = ["sunny", "cloudy", "rainy", "windy", "foggy", "snowy", "hazy", "xylophone"]

for k in [1, 3, 5, 8]:
    probs = top_k_sampling(logits, k=k)
    print(f"\nTop-K = {k}")
    for word, prob in zip(words, probs):
        if prob > 0.001:
            bar = "█" * int(prob * 40)
            print(f"  {word:12s}: {prob:.3f} {bar}")

Output:

Top-K = 1
  sunny       : 1.000 ████████████████████████████████████████

Top-K = 3
  sunny       : 0.665 ██████████████████████████
  cloudy      : 0.245 █████████
  rainy       : 0.090 ███

Top-K = 5
  sunny       : 0.580 ███████████████████████
  cloudy      : 0.213 ████████
  rainy       : 0.078 ███
  windy       : 0.029 █
  foggy       : 0.017 

Top-K = 8
  sunny       : 0.571 ██████████████████████
  cloudy      : 0.210 ████████
  rainy       : 0.077 ███
  windy       : 0.028 █
  foggy       : 0.017 
  snowy       : 0.004 
  hazy        : 0.001 
  xylophone   : 0.000

Notice: K=1 is exactly greedy decoding (always pick the top word). K=8 includes everything but still concentrates probability on the top words. The sweet spot for most applications is around K=40-50.

5. Top-P (Nucleus) Sampling

Top-k has a subtle problem: the right value of K depends on the situation. Sometimes the model is very confident — one word has 95% probability and everything else is noise. In that case, even K=10 includes 9 words that are basically garbage. Other times the model is genuinely uncertain — the top 50 words each have about 2% probability. In that case, K=10 is too restrictive and throws away perfectly good options.

Top-p sampling (also called nucleus sampling) solves this by being adaptive. Instead of keeping a fixed number of words, it keeps adding words from most probable to least probable until their cumulative probability exceeds a threshold P.

If the model is very confident:

Word 1 has 92% probability → cumulative: 92% → exceeds P=0.9 → stop. Only 1 word considered.

If the model is uncertain:

Word 1: 15% → cumulative: 15%
Word 2: 12% → cumulative: 27%
Word 3: 10% → cumulative: 37%
…
Word 8: 6% → cumulative: 91% → exceeds P=0.9 → stop. 8 words considered.

The vocabulary size adapts automatically to the model’s confidence. This is why top-p is the standard in production systems.

Implementation

def top_p_sampling(logits, p=0.9):
    """
    Keep the smallest set of tokens whose cumulative probability exceeds p.

    Args:
        logits: raw scores, shape (vocab_size,)
        p: cumulative probability threshold (0 to 1)

    Returns:
        probabilities: shape (vocab_size,) with low-probability tokens zeroed out
    """
    # Sort logits in descending order
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)

    # Convert sorted logits to probabilities
    sorted_probs = F.softmax(sorted_logits, dim=-1)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find where cumulative probability exceeds p
    # We want to KEEP the token that pushes us over p,
    # so we shift the mask by 1
    sorted_mask = cumulative_probs - sorted_probs >= p

    # Zero out the filtered logits
    sorted_logits[sorted_mask] = float('-inf')

    # Unsort: put logits back in their original positions
    original_logits = torch.zeros_like(logits)
    original_logits.scatter_(0, sorted_indices, sorted_logits)

    # Convert to probabilities
    probabilities = F.softmax(original_logits, dim=-1)
    return probabilities

Let’s break down the tricky part. After sorting, we compute cumulative probabilities: [0.58, 0.79, 0.87, 0.90, 0.92, ...]. We want to keep all tokens up to and including the one that pushes us past p=0.9. The trick is cumulative_probs - sorted_probs >= p — this shifts the threshold by one, ensuring we include the boundary token.

Seeing Top-P in Action

# Confident distribution: one word dominates
confident_logits = torch.tensor([10.0, 2.0, 1.0, 0.5, 0.1])
words = ["sunny", "cloudy", "rainy", "windy", "foggy"]

probs = top_p_sampling(confident_logits, p=0.9)
print("Confident model (p=0.9):")
for word, prob in zip(words, probs):
    if prob > 0.001:
        print(f"  {word}: {prob:.3f}")
# Only "sunny" survives — it alone exceeds 90%

# Uncertain distribution: probabilities are spread out
uncertain_logits = torch.tensor([2.0, 1.8, 1.5, 1.3, 1.0])

probs = top_p_sampling(uncertain_logits, p=0.9)
print("\nUncertain model (p=0.9):")
for word, prob in zip(words, probs):
    if prob > 0.001:
        print(f"  {word}: {prob:.3f}")
# Most words survive — the model needs many to reach 90%

This adaptiveness is why top-p is preferred over top-k in most production LLMs.

6. Combining Strategies

In practice, you don’t use just one strategy. The standard combination used by ChatGPT, Claude, and most modern LLMs is temperature + top-p. Here’s the order of operations:

Get raw logits from the model
Apply temperature scaling (divide logits by T)
Apply top-p filtering (zero out unlikely tokens)
Sample from the remaining distribution

def generate(model, tokenizer, prompt, max_new_tokens=100,
             temperature=0.8, top_k=None, top_p=0.9):
    """
    Generate text with temperature, top-k, and top-p sampling.

    Args:
        model: trained GPT model
        tokenizer: tokenizer with encode/decode
        prompt: starting text
        max_new_tokens: number of tokens to generate
        temperature: controls randomness (0.1-2.0 typical)
        top_k: if set, only consider top-k tokens
        top_p: if set, use nucleus sampling with this threshold
    """
    token_ids = tokenizer.encode(prompt)
    token_ids = torch.tensor([token_ids], dtype=torch.long)

    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(token_ids)
            next_logits = logits[0, -1, :]  # (vocab_size,)

            # Step 1: Apply temperature
            next_logits = next_logits / temperature

            # Step 2: Apply top-k filtering (if specified)
            if top_k is not None:
                top_k_values, _ = torch.topk(next_logits, top_k)
                threshold = top_k_values[-1]
                next_logits[next_logits < threshold] = float('-inf')

            # Step 3: Apply top-p filtering (if specified)
            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(
                    next_logits, descending=True
                )
                sorted_probs = F.softmax(sorted_logits, dim=-1)
                cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

                # Remove tokens with cumulative probability above p
                sorted_mask = (cumulative_probs - sorted_probs) >= top_p
                sorted_logits[sorted_mask] = float('-inf')

                # Unsort
                next_logits = torch.zeros_like(next_logits)
                next_logits.scatter_(0, sorted_indices, sorted_logits)

            # Step 4: Sample
            probs = F.softmax(next_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            token_ids = torch.cat([token_ids, next_token.unsqueeze(0)], dim=1)

    return tokenizer.decode(token_ids[0].tolist())

Example Outputs

Different parameter combinations produce dramatically different text:

prompt = "Once upon a time"

# Conservative: predictable, coherent
print(generate(model, tokenizer, prompt, temperature=0.3, top_p=0.9))
# "Once upon a time there was a little girl who lived in a small village..."

# Balanced: natural, varied
print(generate(model, tokenizer, prompt, temperature=0.8, top_p=0.9))
# "Once upon a time the old merchant crossed the bridge with his peculiar cat..."

# Creative: surprising, occasionally odd
print(generate(model, tokenizer, prompt, temperature=1.2, top_p=0.95))
# "Once upon a time rain splattered across forgotten rooftops while dreams..."

The combination of temperature=0.7-0.9 with top_p=0.9-0.95 is the sweet spot for most text generation tasks. Lower temperature for factual tasks (like coding or summarization), higher for creative tasks (like story writing).

7. KV-Caching for Efficient Generation

Now let’s talk about speed. There’s a massive inefficiency hiding in our generation loop. Look at what happens:

Step 1: Process tokens [A, B, C]           → predict next token D
Step 2: Process tokens [A, B, C, D]        → predict next token E
Step 3: Process tokens [A, B, C, D, E]     → predict next token F
Step 4: Process tokens [A, B, C, D, E, F]  → predict next token G

At step 4, the model recomputes attention for tokens A, B, C, D, and E — even though it already computed those exact results in step 3! Only token F is new. We’re doing the same work over and over.

This is like re-reading an entire book every time you turn to a new page. By chapter 30, you’re spending 99% of your time re-reading chapters 1-29 and only 1% on the new material.

KV-caching solves this by saving the key and value tensors from attention. Remember from Chapter 5 that attention computes Q @ K^T @ V. The keys (K) and values (V) for old tokens don’t change when we add a new token — only the new token generates new keys and values. So we cache the old ones and only compute the new ones.

How KV-Cache Works

Without cache:

Step 1: Compute K,V for [A, B, C]        → 3 key-value pairs
Step 2: Compute K,V for [A, B, C, D]     → 4 key-value pairs (3 redundant!)
Step 3: Compute K,V for [A, B, C, D, E]  → 5 key-value pairs (4 redundant!)

With cache:

Step 1: Compute K,V for [A, B, C]            → cache 3 pairs
Step 2: Compute K,V for [D] only, append     → cache now has 4 pairs
Step 3: Compute K,V for [E] only, append     → cache now has 5 pairs

At step 3, instead of computing 5 key-value pairs, we compute only 1 and concatenate it with the 4 cached pairs. This turns generation from $O(n^2)$ per token to $O(n)$ per token — a huge speedup for long sequences.

Basic Implementation

Let’s modify our attention layer to support KV-caching:

class CachedSelfAttention(nn.Module):
    """Self-attention with optional KV-caching for efficient generation."""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None, kv_cache=None):
        """
        Args:
            x: input tensor (batch, seq_len, d_model)
            mask: causal mask
            kv_cache: tuple of (cached_keys, cached_values) or None

        Returns:
            output: (batch, seq_len, d_model)
            new_kv_cache: updated (keys, values) tuple
        """
        batch_size, seq_len, _ = x.shape

        # Compute Q, K, V for the NEW tokens only
        Q = self.W_q(x)  # (batch, seq_len, d_model)
        K = self.W_k(x)
        V = self.W_v(x)

        # Reshape for multi-head attention
        Q = Q.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        # Shape: (batch, n_heads, seq_len, head_dim)

        # If we have cached keys/values, concatenate
        if kv_cache is not None:
            cached_K, cached_V = kv_cache
            K = torch.cat([cached_K, K], dim=2)  # Append new keys
            V = torch.cat([cached_V, V], dim=2)  # Append new values

        # Save updated cache for next step
        new_kv_cache = (K, V)

        # Standard attention: Q @ K^T / sqrt(d) @ V
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)

        # Reshape back
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
        output = self.W_o(output)

        return output, new_kv_cache

Using the Cache in Generation

def generate_with_cache(model, tokenizer, prompt, max_new_tokens=100,
                        temperature=0.8, top_p=0.9):
    """
    Generate text using KV-caching for efficiency.

    First pass: process the entire prompt and fill the cache.
    Subsequent passes: process only the new token, using cached K,V.
    """
    token_ids = tokenizer.encode(prompt)
    token_ids = torch.tensor([token_ids], dtype=torch.long)

    model.eval()
    kv_caches = [None] * model.config.n_layers  # One cache per layer

    with torch.no_grad():
        # First pass: process the full prompt
        logits, kv_caches = model.forward_with_cache(token_ids, kv_caches)
        next_logits = logits[0, -1, :]

        for _ in range(max_new_tokens):
            # Apply temperature + top-p
            next_logits = next_logits / temperature
            probs = top_p_sampling(next_logits, p=top_p)
            next_token = torch.multinomial(probs, num_samples=1)

            token_ids = torch.cat([token_ids, next_token.unsqueeze(0)], dim=1)

            # Subsequent passes: only process the NEW token
            new_input = next_token.unsqueeze(0)  # (1, 1)
            logits, kv_caches = model.forward_with_cache(new_input, kv_caches)
            next_logits = logits[0, -1, :]

    return tokenizer.decode(token_ids[0].tolist())

The speedup is dramatic. For a sequence of length 1000, generation without KV-cache does roughly 500,000 attention computations. With KV-cache, it does only about 1000. That’s a 500× reduction in redundant work.

Note: Our simplified implementation above shows the core concept. Production implementations (like those in Hugging Face Transformers) handle additional details such as cache memory management and position encoding alignment. The principle is the same: cache keys and values, only compute new ones.

8. Repetition Penalty

Even with temperature and top-p, language models sometimes fall into loops: “The cat sat on the mat. The cat sat on the mat. The cat sat on the mat.” This happens because the model learns statistical patterns, and once it enters a common phrase, the next token is highly predictable.

Repetition penalty directly addresses this by reducing the probability of tokens that have already appeared in the generated text. The idea is simple: if a word has been used before, make it less likely to be chosen again.

Implementation

def apply_repetition_penalty(logits, generated_token_ids, penalty=1.2):
    """
    Reduce the logits of tokens that have already been generated.

    Args:
        logits: raw scores, shape (vocab_size,)
        generated_token_ids: list of token IDs generated so far
        penalty: float > 1.0. Higher = more penalty for repetition.
                 1.0 = no penalty.

    Returns:
        modified logits with repeated tokens penalized
    """
    logits = logits.clone()

    for token_id in set(generated_token_ids):
        # If the logit is positive, divide by penalty (make it smaller)
        # If the logit is negative, multiply by penalty (make it more negative)
        if logits[token_id] > 0:
            logits[token_id] = logits[token_id] / penalty
        else:
            logits[token_id] = logits[token_id] * penalty

    return logits

Why the positive/negative split? We want to always make repeated tokens less likely. For positive logits, dividing by a number > 1 makes them smaller (less probable). For negative logits, multiplying by a number > 1 makes them more negative (also less probable).

# Example
logits = torch.tensor([5.0, 3.0, 2.0, 1.0, -1.0])
words = ["the", "cat", "sat", "on", "a"]
already_generated = [0, 1, 2]  # "the", "cat", "sat" already appeared

penalized = apply_repetition_penalty(logits, already_generated, penalty=1.5)

print("Original logits vs penalized:")
for word, orig, pen in zip(words, logits, penalized):
    marker = " ← penalized" if pen != orig else ""
    print(f"  {word:5s}: {orig:.1f} → {pen:.2f}{marker}")

# Output:
# Original logits vs penalized:
#   the  : 5.0 → 3.33 ← penalized
#   cat  : 3.0 → 2.00 ← penalized
#   sat  : 2.0 → 1.33 ← penalized
#   on   : 1.0 → 1.00
#   a    : -1.0 → -1.00

The previously used words — “the,” “cat,” “sat” — all get lower scores, making room for fresh words like “on” and “a.”

9. Complete Generation Script

Let’s put everything together into a polished, production-style generation function:

import torch
import torch.nn.functional as F


def generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=200,
    temperature=0.8,
    top_k=None,
    top_p=0.9,
    repetition_penalty=1.1,
    stop_tokens=None,
):
    """
    Generate text from a trained language model.

    Combines temperature scaling, top-k filtering, top-p (nucleus) sampling,
    and repetition penalty for high-quality text generation.

    Args:
        model: trained GPT model (nn.Module)
        tokenizer: tokenizer with encode() and decode() methods
        prompt: string to start generation from
        max_new_tokens: maximum number of tokens to generate
        temperature: controls randomness (default 0.8)
            - 0.1-0.5: very focused, deterministic
            - 0.7-0.9: balanced (recommended for most tasks)
            - 1.0-1.5: creative, more varied
        top_k: if set, only sample from top-k most probable tokens
        top_p: nucleus sampling threshold (default 0.9)
            - 0.1-0.5: very selective
            - 0.8-0.95: balanced (recommended)
            - 1.0: consider all tokens
        repetition_penalty: penalty for repeated tokens (default 1.1)
            - 1.0: no penalty
            - 1.1-1.3: mild to moderate penalty (recommended)
            - 1.5+: strong penalty
        stop_tokens: list of token IDs that end generation early

    Returns:
        generated text as a string
    """
    # Encode prompt
    token_ids = tokenizer.encode(prompt)
    generated = list(token_ids)  # Keep track of all generated token IDs
    token_ids = torch.tensor([token_ids], dtype=torch.long)

    if stop_tokens is None:
        stop_tokens = []

    model.eval()
    with torch.no_grad():
        for step in range(max_new_tokens):
            # Forward pass
            logits = model(token_ids)
            next_logits = logits[0, -1, :].clone()  # (vocab_size,)

            # --- Step 1: Repetition penalty ---
            if repetition_penalty != 1.0:
                next_logits = apply_repetition_penalty(
                    next_logits, generated, repetition_penalty
                )

            # --- Step 2: Temperature ---
            if temperature != 1.0:
                next_logits = next_logits / temperature

            # --- Step 3: Top-k filtering ---
            if top_k is not None and top_k > 0:
                top_k_values, _ = torch.topk(next_logits, min(top_k, next_logits.size(-1)))
                threshold = top_k_values[-1]
                next_logits[next_logits < threshold] = float('-inf')

            # --- Step 4: Top-p (nucleus) filtering ---
            if top_p is not None and top_p < 1.0:
                sorted_logits, sorted_indices = torch.sort(next_logits, descending=True)
                sorted_probs = F.softmax(sorted_logits, dim=-1)
                cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

                sorted_mask = (cumulative_probs - sorted_probs) >= top_p
                sorted_logits[sorted_mask] = float('-inf')

                next_logits = torch.zeros_like(next_logits).fill_(float('-inf'))
                next_logits.scatter_(0, sorted_indices, sorted_logits)

            # --- Step 5: Sample ---
            probs = F.softmax(next_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)  # (1,)
            next_token_id = next_token.item()

            # Check for stop token
            if next_token_id in stop_tokens:
                break

            # Append to sequence
            generated.append(next_token_id)
            token_ids = torch.cat([token_ids, next_token.unsqueeze(0)], dim=1)

    return tokenizer.decode(generated)

Using the Complete Generator

# Conservative — good for factual content, summaries
output = generate(
    model, tokenizer,
    prompt="The three main components of a transformer are",
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.1,
    max_new_tokens=100,
)
print("=== Conservative ===")
print(output)

# Balanced — good for general text, articles
output = generate(
    model, tokenizer,
    prompt="In the year 2050,",
    temperature=0.8,
    top_p=0.9,
    repetition_penalty=1.2,
    max_new_tokens=100,
)
print("\n=== Balanced ===")
print(output)

# Creative — good for stories, poetry
output = generate(
    model, tokenizer,
    prompt="The old lighthouse keeper",
    temperature=1.2,
    top_p=0.95,
    repetition_penalty=1.3,
    max_new_tokens=100,
)
print("\n=== Creative ===")
print(output)

Parameter Quick Reference

Parameter	Low Value	High Value	Default
temperature	0.1 (deterministic)	1.5 (creative)	0.8
top_k	1 (greedy)	100 (permissive)	None
top_p	0.1 (selective)	1.0 (all tokens)	0.9
repetition_penalty	1.0 (none)	2.0 (strong)	1.1

10. Exercises

Exercise 1: Temperature Explorer

Write a function that generates text from the same prompt with five different temperatures [0.2, 0.5, 0.8, 1.0, 1.5] and prints all outputs side by side. Run it multiple times and observe:

At what temperature does the output start becoming incoherent?
At what temperature is the output most “human-like”?
Does running the same temperature twice always give the same result?

Solution

def temperature_explorer(model, tokenizer, prompt, max_new_tokens=60):
    """Generate text at multiple temperatures for comparison."""
    temperatures = [0.2, 0.5, 0.8, 1.0, 1.5]

    for temp in temperatures:
        print(f"\n{'='*60}")
        print(f"Temperature = {temp}")
        print(f"{'='*60}")

        # Generate 3 samples at each temperature to show variance
        for sample in range(3):
            output = generate(
                model, tokenizer, prompt,
                temperature=temp,
                top_p=0.95,
                max_new_tokens=max_new_tokens,
            )
            # Show only the generated part (not the prompt)
            generated_part = output[len(prompt):]
            print(f"  Sample {sample+1}: ...{generated_part}")

# Usage
temperature_explorer(model, tokenizer, "The future of artificial intelligence")

Observations:

T=0.2: Nearly identical outputs every run. Very predictable.
T=0.5: Minor variations, still coherent and “safe.”
T=0.8: Noticeably different each run. Natural-sounding.
T=1.0: More creative, occasionally surprising word choices.
T=1.5: Very random. Sometimes brilliant, sometimes gibberish. High variance.
Running the same temperature twice gives different results (because of sampling) unless T is very low.

Exercise 2: Implement Top-K + Top-P Together

Modify the top_k_sampling and top_p_sampling functions to work together: first apply top-k to narrow the candidates, then apply top-p within those candidates. Test with K=50, P=0.9 and compare against using top-p alone.

Solution

def top_k_top_p_sampling(logits, k=50, p=0.9):
    """
    Apply top-k THEN top-p filtering.
    Top-k first narrows to K candidates. Top-p then further filters
    within those K candidates based on cumulative probability.
    """
    logits = logits.clone()

    # Step 1: Top-k — keep only top K logits
    if k > 0:
        top_k_values, _ = torch.topk(logits, min(k, logits.size(-1)))
        threshold = top_k_values[-1]
        logits[logits < threshold] = float('-inf')

    # Step 2: Top-p — within the remaining, filter by cumulative prob
    if p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        sorted_probs = F.softmax(sorted_logits, dim=-1)
        cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

        sorted_mask = (cumulative_probs - sorted_probs) >= p
        sorted_logits[sorted_mask] = float('-inf')

        logits = torch.zeros_like(logits).fill_(float('-inf'))
        logits.scatter_(0, sorted_indices, sorted_logits)

    probs = F.softmax(logits, dim=-1)
    return probs


# Compare results
logits = torch.randn(1000)  # Simulate 1000-word vocabulary

probs_p_only = top_p_sampling(logits, p=0.9)
probs_k_p = top_k_top_p_sampling(logits, k=50, p=0.9)

non_zero_p = (probs_p_only > 0).sum().item()
non_zero_kp = (probs_k_p > 0).sum().item()

print(f"Top-p only:      {non_zero_p} non-zero tokens")
print(f"Top-k + top-p:   {non_zero_kp} non-zero tokens")
# Top-k + top-p will always have <= min(k, top-p count) non-zero tokens

Key insight: Top-k provides a hard upper bound on vocabulary size, while top-p provides adaptive filtering. Using both gives you the best of both worlds — a guaranteed cap on vocabulary size with adaptive pruning within that cap.

Exercise 3: Measure KV-Cache Speedup

Write a benchmark that generates 100 tokens with and without KV-caching and measures the time difference. Use time.time() to measure wall-clock time. How much faster is the cached version? How does the speedup change with longer prompts?

Solution

import time


def benchmark_generation(model, tokenizer, prompt, max_new_tokens=100):
    """Compare generation speed with and without KV-cache."""

    # Without cache (our original generate function)
    start = time.time()
    output_no_cache = generate(
        model, tokenizer, prompt,
        max_new_tokens=max_new_tokens,
        temperature=0.8,
        top_p=0.9,
    )
    time_no_cache = time.time() - start

    # With cache
    start = time.time()
    output_with_cache = generate_with_cache(
        model, tokenizer, prompt,
        max_new_tokens=max_new_tokens,
        temperature=0.8,
        top_p=0.9,
    )
    time_with_cache = time.time() - start

    print(f"Prompt length:    {len(tokenizer.encode(prompt))} tokens")
    print(f"Generated:        {max_new_tokens} tokens")
    print(f"Without cache:    {time_no_cache:.3f}s")
    print(f"With cache:       {time_with_cache:.3f}s")
    print(f"Speedup:          {time_no_cache / time_with_cache:.1f}x")


# Test with different prompt lengths
for prompt_len in ["Short", "A " * 50, "A " * 200]:
    prompt = prompt_len if len(prompt_len) > 10 else prompt_len + " prompt"
    print(f"\n{'='*50}")
    benchmark_generation(model, tokenizer, prompt, max_new_tokens=100)

Expected observations:

Speedup starts modest for short sequences (2-3×) because overhead dominates.
Speedup grows with longer sequences. For 200+ token sequences, expect 5-10× or more.
The speedup comes from not recomputing attention for all previous tokens at each step.

Exercise 4: Build a Chat Loop

Combine the generation function with user input to create a simple interactive chat loop. The user types a message, the model responds, and the conversation continues. Hint: concatenate the conversation history as the prompt for each new generation.

Solution

def chat(model, tokenizer, max_history_tokens=512):
    """Simple interactive chat loop with a trained language model."""
    print("Chat with your model! Type 'quit' to exit.\n")

    history = ""

    while True:
        user_input = input("You: ")
        if user_input.lower() in ("quit", "exit", "q"):
            print("Goodbye!")
            break

        # Build the prompt from conversation history
        history += f"User: {user_input}\nAssistant:"

        # Truncate history if it's too long
        history_tokens = tokenizer.encode(history)
        if len(history_tokens) > max_history_tokens:
            # Keep only the most recent tokens
            history_tokens = history_tokens[-max_history_tokens:]
            history = tokenizer.decode(history_tokens)

        # Generate response
        full_output = generate(
            model, tokenizer,
            prompt=history,
            max_new_tokens=100,
            temperature=0.8,
            top_p=0.9,
            repetition_penalty=1.2,
        )

        # Extract only the new response
        response = full_output[len(history):].strip()

        # Cut off at the next "User:" if the model generates one
        if "User:" in response:
            response = response[:response.index("User:")]

        print(f"Assistant: {response}\n")

        # Update history
        history += f" {response}\n"


# Run the chat
# chat(model, tokenizer)

Note: This simple chat loop has no special formatting or system prompts — it relies purely on the patterns in the training data. A model trained on dialogue data will naturally learn the back-and-forth pattern. For better results, fine-tuning on instruction-following data (covered in Chapter 10) dramatically improves chat quality.

Summary

In this chapter, we went from a trained model that produces logits to a system that generates coherent, controllable text:

Autoregressive generation produces text one token at a time, feeding each generated token back as input. The loop is simple: predict → sample → append → repeat.
Greedy decoding always picks the most probable token. It’s simple but produces repetitive, boring output because it gets trapped in high-probability loops.
Temperature scales logits before softmax. Low temperature (< 1.0) sharpens the distribution for more predictable output. High temperature (> 1.0) flattens it for more creative output.
Top-k sampling limits choices to the K most probable tokens, preventing nonsense words from being selected while allowing variety among reasonable options.
Top-p (nucleus) sampling adaptively selects vocabulary size based on the model’s confidence. It keeps the smallest set of tokens whose cumulative probability exceeds a threshold P.
Combining strategies — temperature + top-p is the industry standard. Apply temperature first, then filter with top-p, then sample.
KV-caching eliminates redundant computation by saving key and value tensors from previous tokens. Only the new token’s keys and values need to be computed at each step.
Repetition penalty directly reduces the probability of previously generated tokens, breaking repetitive loops.

The complete generate() function you built in this chapter uses the same strategies as production LLMs like ChatGPT and Llama. The difference is scale: your model has ~661K parameters, while GPT-4 has trillions. But the generation algorithm is identical.

In the next chapter, we’ll scale things up. Chapter 9 covers distributed training — how to train models so large they don’t fit on a single GPU, using data parallelism, model parallelism, and mixed-precision training. That’s where we go from a toy model to understanding how the big models are actually built.