Building the Model — From Blocks to a Complete LLM

In the previous chapter, you learned the three core transformer components: multi-head self-attention (how tokens look at each other), feed-forward networks (how each token processes information independently), and layer normalization with residual connections (how gradients flow through deep networks).

You now have all the Lego pieces. This chapter is about snapping them together.

Think of building a skyscraper. Each floor has the same blueprint: a lobby, elevators, offices, and restrooms. You don’t redesign the floor plan for every level — you repeat the same design, stacking floor upon floor. A transformer works exactly the same way. Each “floor” is a transformer block containing attention, feed-forward, and normalization layers. Stack enough of these identical floors and you get GPT-2, GPT-3, or any other large language model.

By the end of this chapter, you will:

Wrap attention + feed-forward + normalization into a clean TransformerBlock module
Stack multiple blocks to build depth
Assemble a complete GPT-style model with embeddings, blocks, and an output head
Understand what the model outputs (logits) and how to turn them into probabilities
Count every parameter in the model
Manage hyperparameters with a clean configuration class
Run a complete forward pass from token IDs to predictions

Let’s build.

1. The Big Picture

Here’s the full architecture we’re going to construct, top to bottom:

Input token IDs: [14, 87, 203, 55]
         │
    ┌────▼─────┐
    │  Token    │  Look up a dense vector for each token
    │ Embedding │
    └────┬─────┘
         │
    ┌────▼──────┐
    │ Position   │  Add position information
    │ Embedding  │
    └────┬──────┘
         │
    ┌────▼──────────────┐
    │ Transformer Block 1│  Attention → FFN → Normalize
    └────┬──────────────┘
         │
    ┌────▼──────────────┐
    │ Transformer Block 2│  Same structure, different weights
    └────┬──────────────┘
         │
        ...               (repeat N times)
         │
    ┌────▼──────┐
    │ Final     │  One last normalization
    │ LayerNorm │
    └────┬──────┘
         │
    ┌────▼──────────┐
    │ Output Linear  │  Project to vocabulary size
    └────┬──────────┘
         │
    Output logits: scores for every word in vocabulary

Every component here is something you’ve already built. Now we’re wiring them together into a single nn.Module that PyTorch can train.

2. The Transformer Decoder Block

Let’s start with a single “floor” of our skyscraper. A transformer decoder block takes a tensor of shape (batch_size, seq_len, d_model), runs it through attention and a feed-forward network with residual connections and layer normalization, and outputs a tensor of the exact same shape.

This is critical: the input shape equals the output shape. That’s what makes stacking possible. If the first block changes the shape, we couldn’t feed its output into a second identical block.

import torch
import torch.nn as nn
import math


class MultiHeadSelfAttention(nn.Module):
    """Multi-head self-attention from CH5, packaged as a module."""

    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"

        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        # Combined projection for Q, K, V (more efficient than three separate ones)
        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch_size, seq_len, d_model = x.shape

        # Project to Q, K, V all at once, then split
        qkv = self.qkv_proj(x)  # (batch, seq_len, 3 * d_model)
        qkv = qkv.reshape(batch_size, seq_len, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, n_heads, seq_len, head_dim)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # Scaled dot-product attention
        scale = math.sqrt(self.head_dim)
        scores = torch.matmul(q, k.transpose(-2, -1)) / scale
        # scores: (batch, n_heads, seq_len, seq_len)

        # Apply causal mask: prevent attending to future tokens
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Weighted sum of values
        attn_output = torch.matmul(attn_weights, v)
        # (batch, n_heads, seq_len, head_dim)

        # Concatenate heads and project
        attn_output = attn_output.transpose(1, 2).reshape(batch_size, seq_len, d_model)
        return self.out_proj(attn_output)


class FeedForward(nn.Module):
    """Position-wise feed-forward network from CH5."""

    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),              # GPT-2 uses GELU, not ReLU
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)  # (batch, seq_len, d_model) → (batch, seq_len, d_model)

Now we wrap these into a single block with residual connections and layer normalization:

class TransformerBlock(nn.Module):
    """A single transformer decoder block.

    Input:  (batch_size, seq_len, d_model)
    Output: (batch_size, seq_len, d_model)  ← same shape!
    """

    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()

        # Sub-layers
        self.attention = MultiHeadSelfAttention(d_model, n_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)

        # Layer normalization (applied BEFORE each sub-layer — "Pre-Norm")
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout for residual connections
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm attention with residual connection
        # x → Norm → Attention → Dropout → + x
        normed = self.norm1(x)
        attn_out = self.attention(normed, mask)
        x = x + self.dropout(attn_out)

        # Pre-norm feed-forward with residual connection
        # x → Norm → FFN → + x
        normed = self.norm2(x)
        ff_out = self.feed_forward(normed)
        x = x + ff_out

        return x  # Same shape: (batch_size, seq_len, d_model)

Let’s verify the shape is preserved:

# Quick test
d_model = 128
n_heads = 4
d_ff = 512

block = TransformerBlock(d_model, n_heads, d_ff)

# Fake input: batch of 2 sequences, each 10 tokens, 128-dimensional
x = torch.randn(2, 10, d_model)
output = block(x)

print(f"Input shape:  {x.shape}")      # torch.Size([2, 10, 128])
print(f"Output shape: {output.shape}")  # torch.Size([2, 10, 128])
# ✓ Shapes match — we can stack these!

Notice we’re using Pre-Norm ordering (normalize before each sub-layer) rather than Post-Norm (normalize after). GPT-2 and most modern models use Pre-Norm because it makes training more stable, especially for deep networks.

3. Stacking Blocks

Now for the magic trick: stacking. Since every block has the same input and output shape, we can chain as many as we want.

PyTorch provides nn.ModuleList for exactly this purpose. It’s like a regular Python list, but it tells PyTorch “these are all modules with learnable parameters — please track them.”

# Stack 4 transformer blocks
n_layers = 4
blocks = nn.ModuleList([
    TransformerBlock(d_model=128, n_heads=4, d_ff=512)
    for _ in range(n_layers)
])

# Data flows through each block in sequence
x = torch.randn(2, 10, 128)  # (batch=2, seq_len=10, d_model=128)

for i, block in enumerate(blocks):
    x = block(x)
    print(f"After block {i}: {x.shape}")

# After block 0: torch.Size([2, 10, 128])
# After block 1: torch.Size([2, 10, 128])
# After block 2: torch.Size([2, 10, 128])
# After block 3: torch.Size([2, 10, 128])

Every block preserves the shape. The data gets “refined” as it passes through each layer — early layers learn simple patterns (like word associations), while deeper layers learn more abstract relationships (like grammar and logic).

Why Start Small?

We’re using d_model=128, n_heads=4, n_layers=4 — a tiny model by any standard. GPT-2 uses d_model=768, n_heads=12, n_layers=12. Why start small?

Fast training: A tiny model trains in seconds, not hours. You can experiment freely.
Easy debugging: When something goes wrong, there are fewer places to look.
Same architecture: The code doesn’t change when you scale up. Only the numbers change.

Once your tiny model works correctly, scaling up is literally changing four numbers in a config file. We’ll see that shortly.

4. The Full GPT Model Class

Now let’s assemble everything into a complete model. A GPT-style language model has five components:

Token embedding — converts token IDs to vectors
Position embedding — adds position information
N transformer blocks — the core processing layers
Final layer norm — stabilizes the output
Output projection — maps from d_model dimensions to vocab_size scores

class GPTModel(nn.Module):
    """A complete GPT-style language model.

    Takes token IDs as input, produces logits (scores) for every
    token in the vocabulary as output.
    """

    def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff,
                 max_seq_len, dropout=0.1):
        super().__init__()

        # --- Embeddings ---
        # Token embedding: vocab_size → d_model
        self.token_embedding = nn.Embedding(vocab_size, d_model)

        # Position embedding: max_seq_len → d_model (learned, not sinusoidal)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)

        self.dropout = nn.Dropout(dropout)

        # --- Transformer blocks ---
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])

        # --- Output head ---
        self.final_norm = nn.LayerNorm(d_model)
        self.output_proj = nn.Linear(d_model, vocab_size, bias=False)

        # Store config
        self.max_seq_len = max_seq_len
        self.d_model = d_model

    def forward(self, token_ids):
        """
        Args:
            token_ids: (batch_size, seq_len) — integer token IDs

        Returns:
            logits: (batch_size, seq_len, vocab_size) — raw prediction scores
        """
        batch_size, seq_len = token_ids.shape
        assert seq_len <= self.max_seq_len, \
            f"Sequence length {seq_len} exceeds max {self.max_seq_len}"

        # Step 1: Token embeddings
        # (batch_size, seq_len) → (batch_size, seq_len, d_model)
        tok_emb = self.token_embedding(token_ids)

        # Step 2: Position embeddings
        # Create position indices: [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=token_ids.device)
        # (seq_len,) → (seq_len, d_model) → broadcasts to (batch_size, seq_len, d_model)
        pos_emb = self.position_embedding(positions)

        # Step 3: Combine and apply dropout
        # (batch_size, seq_len, d_model)
        x = self.dropout(tok_emb + pos_emb)

        # Step 4: Create causal mask
        # Upper-triangular matrix of -inf prevents attending to future tokens
        causal_mask = torch.tril(torch.ones(seq_len, seq_len, device=token_ids.device))
        # causal_mask shape: (seq_len, seq_len)
        # [[1, 0, 0, 0],
        #  [1, 1, 0, 0],
        #  [1, 1, 1, 0],
        #  [1, 1, 1, 1]]

        # Step 5: Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask=causal_mask)
        # Still (batch_size, seq_len, d_model)

        # Step 6: Final layer normalization
        x = self.final_norm(x)
        # Still (batch_size, seq_len, d_model)

        # Step 7: Project to vocabulary size
        logits = self.output_proj(x)
        # (batch_size, seq_len, d_model) → (batch_size, seq_len, vocab_size)

        return logits

Let’s make sure it works:

# Create a tiny model
model = GPTModel(
    vocab_size=1000,
    d_model=128,
    n_heads=4,
    n_layers=2,
    d_ff=512,
    max_seq_len=64,
    dropout=0.1,
)

# Fake input: batch of 2 sequences, each 10 tokens
token_ids = torch.randint(0, 1000, (2, 10))  # Random token IDs

# Forward pass
logits = model(token_ids)
print(f"Input shape:  {token_ids.shape}")  # torch.Size([2, 10])
print(f"Output shape: {logits.shape}")     # torch.Size([2, 10, 1000])

The model takes in token IDs and produces a score for every word in the vocabulary at every position. But what do those scores mean? Let’s find out.

5. Understanding the Output

What Are Logits?

The model outputs a tensor of shape (batch_size, seq_len, vocab_size). These raw scores are called logits (rhymes with “low-jits”). For each position in the sequence, the model produces one score per word in the vocabulary.

Think of it like a panel of judges scoring contestants. At each position, the model “scores” every word in the vocabulary on how likely it is to come next. Higher score = more likely. These scores are not probabilities yet — they can be negative, and they don’t sum to 1.

# Look at the logits for the last position in the first sequence
last_position_logits = logits[0, -1, :]  # (vocab_size,) = (1000,)

print(f"Logits shape: {last_position_logits.shape}")
print(f"Min logit:    {last_position_logits.min().item():.4f}")
print(f"Max logit:    {last_position_logits.max().item():.4f}")
print(f"Sum of logits: {last_position_logits.sum().item():.4f}")
# Sum is NOT 1.0 — these are not probabilities yet!

From Logits to Probabilities

To turn logits into probabilities, we apply the softmax function. Softmax does two things: it makes all values positive, and it makes them sum to 1.

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$$

# Convert logits to probabilities using softmax
probabilities = torch.softmax(last_position_logits, dim=-1)

print(f"Min probability: {probabilities.min().item():.6f}")
print(f"Max probability: {probabilities.max().item():.6f}")
print(f"Sum:             {probabilities.sum().item():.6f}")
# Sum ≈ 1.0 — now they're real probabilities!

# Which word has the highest probability?
predicted_token_id = probabilities.argmax().item()
print(f"Most likely next token ID: {predicted_token_id}")
print(f"Its probability: {probabilities[predicted_token_id].item():.6f}")

Right now the model is untrained, so these probabilities are essentially random — every word is roughly equally likely. After training (Chapter 7), the model will assign high probability to words that make sense in context and near-zero probability to words that don’t.

Why Logits, Not Probabilities?

You might wonder: why doesn’t the model output probabilities directly? Two reasons:

Numerical stability: During training, we compute the loss function using logits. PyTorch’s CrossEntropyLoss takes raw logits and applies softmax internally in a numerically stable way. If we applied softmax first, we’d lose precision.
Flexibility: Sometimes we want to manipulate the scores before converting to probabilities — for example, dividing by a “temperature” value to make the model more or less creative. Working with raw logits makes this easy.

6. Parameter Counting

How many learnable parameters does our tiny model have? Let’s count them one component at a time.

Manual Calculation

Our model has: vocab_size=1000, d_model=128, n_heads=4, n_layers=2, d_ff=512, max_seq_len=64.

Token embedding: A lookup table of vocab_size × d_model values.

$$1000 \times 128 = 128{,}000$$

Position embedding: A lookup table of max\_seq\_len × d\_model values.

$$64 \times 128 = 8{,}192$$

Per transformer block:

QKV projection: weight (d_model, 3 × d_model) + bias (3 × d_model) = $128 \times 384 + 384 = 49{,}536$
Output projection: weight (d_model, d_model) + bias (d_model) = $128 \times 128 + 128 = 16{,}512$
LayerNorm 1: scale (d_model) + shift (d_model) = $128 + 128 = 256$
FFN linear 1: weight (d_model, d_ff) + bias (d_ff) = $128 \times 512 + 512 = 66{,}048$
FFN linear 2: weight (d_ff, d_model) + bias (d_model) = $512 \times 128 + 128 = 65{,}664$
LayerNorm 2: scale + shift = $256$
Block total: $198{,}272$

2 blocks: $2 \times 198{,}272 = 396{,}544$

Final LayerNorm: $256$

Output projection (no bias): weight (d_model, vocab_size) = $128 \times 1000 = 128{,}000$

Grand total: $128{,}000 + 8{,}192 + 396{,}544 + 256 + 128{,}000 = 660{,}992$

About 661K parameters. Let’s verify with PyTorch:

def count_parameters(model):
    """Count all trainable parameters in a model."""
    total = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total

total_params = count_parameters(model)
print(f"Total trainable parameters: {total_params:,}")
# Total trainable parameters: 660,992

# Break it down by component
print("\nParameter breakdown:")
print(f"  Token embedding:    {model.token_embedding.weight.numel():>10,}")
print(f"  Position embedding: {model.position_embedding.weight.numel():>10,}")

for i, block in enumerate(model.blocks):
    block_params = sum(p.numel() for p in block.parameters())
    print(f"  Block {i}:             {block_params:>10,}")

print(f"  Final LayerNorm:    {sum(p.numel() for p in model.final_norm.parameters()):>10,}")
print(f"  Output projection:  {model.output_proj.weight.numel():>10,}")

How Does This Compare?

Model	Parameters	d_model	n_heads	n_layers	d_ff
Ours	661K	128	4	2	512
GPT-2 Small	124M	768	12	12	3072
GPT-2 Medium	355M	1024	16	24	4096
GPT-2 Large	774M	1280	20	36	5120
GPT-3	175B	12288	96	96	49152

Our model is about 188 times smaller than GPT-2 Small. The difference comes from scaling three things: wider vectors (d_model), more attention heads, and more layers. The architectural design is identical — only the numbers change.

7. Model Configuration

Passing seven arguments to the constructor is clunky and error-prone. Real codebases use a configuration object to bundle hyperparameters together. Python’s dataclass is perfect for this:

from dataclasses import dataclass


@dataclass
class GPTConfig:
    """Configuration for a GPT-style language model."""

    vocab_size: int = 1000
    d_model: int = 128
    n_heads: int = 4
    n_layers: int = 2
    d_ff: int = 512
    max_seq_len: int = 64
    dropout: float = 0.1

    def __post_init__(self):
        """Validate configuration after initialization."""
        assert self.d_model % self.n_heads == 0, \
            f"d_model ({self.d_model}) must be divisible by n_heads ({self.n_heads})"
        assert self.d_ff > self.d_model, \
            f"d_ff ({self.d_ff}) should be larger than d_model ({self.d_model})"

Now let’s update our model to accept a config:

class GPTModel(nn.Module):
    """A complete GPT-style language model, configured via GPTConfig."""

    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        # Embeddings
        self.token_embedding = nn.Embedding(config.vocab_size, config.d_model)
        self.position_embedding = nn.Embedding(config.max_seq_len, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(config.d_model, config.n_heads,
                             config.d_ff, config.dropout)
            for _ in range(config.n_layers)
        ])

        # Output head
        self.final_norm = nn.LayerNorm(config.d_model)
        self.output_proj = nn.Linear(config.d_model, config.vocab_size, bias=False)

    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        assert seq_len <= self.config.max_seq_len

        # Embeddings
        tok_emb = self.token_embedding(token_ids)
        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.position_embedding(positions)
        x = self.dropout(tok_emb + pos_emb)

        # Causal mask
        causal_mask = torch.tril(
            torch.ones(seq_len, seq_len, device=token_ids.device)
        )

        # Transformer blocks
        for block in self.blocks:
            x = block(x, mask=causal_mask)

        # Output
        x = self.final_norm(x)
        logits = self.output_proj(x)
        return logits

Creating models is now clean and readable:

# Tiny model for experimentation
tiny_config = GPTConfig(
    vocab_size=1000, d_model=128, n_heads=4,
    n_layers=2, d_ff=512, max_seq_len=64,
)
tiny_model = GPTModel(tiny_config)
print(f"Tiny model: {count_parameters(tiny_model):,} params")
# Tiny model: 660,992 params

# Larger model — same code, different numbers
medium_config = GPTConfig(
    vocab_size=50257, d_model=768, n_heads=12,
    n_layers=12, d_ff=3072, max_seq_len=1024,
)
medium_model = GPTModel(medium_config)
print(f"Medium model: {count_parameters(medium_model):,} params")
# Medium model: ~124M params (GPT-2 scale!)

The config also serves as documentation. Anyone reading your code can immediately see the model’s size and structure by looking at the config object.

Predefined Configurations

You might define common configurations as class methods:

@dataclass
class GPTConfig:
    vocab_size: int = 1000
    d_model: int = 128
    n_heads: int = 4
    n_layers: int = 2
    d_ff: int = 512
    max_seq_len: int = 64
    dropout: float = 0.1

    def __post_init__(self):
        assert self.d_model % self.n_heads == 0

    @classmethod
    def tiny(cls):
        """Tiny model for testing and debugging."""
        return cls(vocab_size=1000, d_model=128, n_heads=4,
                   n_layers=2, d_ff=512, max_seq_len=64)

    @classmethod
    def gpt2_small(cls):
        """Matches GPT-2 Small (124M params)."""
        return cls(vocab_size=50257, d_model=768, n_heads=12,
                   n_layers=12, d_ff=3072, max_seq_len=1024)

    @classmethod
    def gpt2_medium(cls):
        """Matches GPT-2 Medium (355M params)."""
        return cls(vocab_size=50257, d_model=1024, n_heads=16,
                   n_layers=24, d_ff=4096, max_seq_len=1024)

8. A Complete Forward Pass

Let’s trace data through the entire model, printing shapes at every stage. This is the most important section of the chapter — it connects every concept from Chapters 3 through 6 into one pipeline.

import torch
import torch.nn as nn
import math
from dataclasses import dataclass


# ────────────────────────────────────────────────
# Step 0: Configuration
# ────────────────────────────────────────────────
@dataclass
class GPTConfig:
    vocab_size: int = 1000
    d_model: int = 128
    n_heads: int = 4
    n_layers: int = 2
    d_ff: int = 512
    max_seq_len: int = 64
    dropout: float = 0.0  # No dropout for this demo


# ────────────────────────────────────────────────
# Build the model
# ────────────────────────────────────────────────
config = GPTConfig()
model = GPTModel(config)
model.eval()  # Disable dropout for deterministic output

print(f"Model created with {count_parameters(model):,} parameters\n")


# ────────────────────────────────────────────────
# Step 1: Start with a "sentence" (token IDs)
# ────────────────────────────────────────────────
# Imagine our tokenizer encoded "The cat sat on" as these IDs:
token_ids = torch.tensor([[42, 15, 87, 203]])  # (1, 4) — batch of 1, 4 tokens
print(f"1. Input token IDs:        {token_ids.shape}")
print(f"   Values: {token_ids.tolist()}\n")


# ────────────────────────────────────────────────
# Step 2: Token embedding
# ────────────────────────────────────────────────
tok_emb = model.token_embedding(token_ids)
print(f"2. After token embedding:  {tok_emb.shape}")
print(f"   Each token ID → {config.d_model}-dimensional vector\n")


# ────────────────────────────────────────────────
# Step 3: Position embedding
# ────────────────────────────────────────────────
positions = torch.arange(token_ids.size(1))
pos_emb = model.position_embedding(positions)
print(f"3. Position embedding:     {pos_emb.shape}")
print(f"   Positions [0, 1, 2, 3] → {config.d_model}-dim vectors\n")


# ────────────────────────────────────────────────
# Step 4: Combine embeddings
# ────────────────────────────────────────────────
x = tok_emb + pos_emb
print(f"4. Combined (tok + pos):   {x.shape}\n")


# ────────────────────────────────────────────────
# Step 5: Pass through transformer blocks
# ────────────────────────────────────────────────
seq_len = token_ids.size(1)
causal_mask = torch.tril(torch.ones(seq_len, seq_len))

for i, block in enumerate(model.blocks):
    x = block(x, mask=causal_mask)
    print(f"5.{i} After transformer block {i}: {x.shape}")

print()


# ────────────────────────────────────────────────
# Step 6: Final layer norm
# ────────────────────────────────────────────────
x = model.final_norm(x)
print(f"6. After final LayerNorm:  {x.shape}\n")


# ────────────────────────────────────────────────
# Step 7: Output projection → logits
# ────────────────────────────────────────────────
logits = model.output_proj(x)
print(f"7. Output logits:          {logits.shape}")
print(f"   = scores for {config.vocab_size} vocab words at each of {seq_len} positions\n")


# ────────────────────────────────────────────────
# Step 8: Get the predicted next word
# ────────────────────────────────────────────────
# We care about the LAST position — it predicts what comes after "on"
last_logits = logits[0, -1, :]  # (vocab_size,)
probabilities = torch.softmax(last_logits, dim=-1)
predicted_id = probabilities.argmax().item()
confidence = probabilities[predicted_id].item()

print(f"8. Prediction for next token:")
print(f"   Most likely token ID: {predicted_id}")
print(f"   Confidence: {confidence:.4f}")
print(f"   (Model is untrained, so this is essentially random)")

Expected output:

Model created with 660,992 parameters

1. Input token IDs:        torch.Size([1, 4])
   Values: [[42, 15, 87, 203]]

2. After token embedding:  torch.Size([1, 4, 128])
   Each token ID → 128-dimensional vector

3. Position embedding:     torch.Size([4, 128])
   Positions [0, 1, 2, 3] → 128-dim vectors

4. Combined (tok + pos):   torch.Size([1, 4, 128])

5.0 After transformer block 0: torch.Size([1, 4, 128])
5.1 After transformer block 1: torch.Size([1, 4, 128])

6. After final LayerNorm:  torch.Size([1, 4, 128])

7. Output logits:          torch.Size([1, 4, 1000])
   = scores for 1000 vocab words at each of 4 positions

8. Prediction for next token:
   Most likely token ID: 547
   Confidence: 0.0032
   (Model is untrained, so this is essentially random)

The confidence is roughly $1/1000 = 0.001$ — the model is guessing randomly among 1000 words. After training, it will assign 60–90% probability to the correct next word.

What Each Position Predicts

An important detail: the model makes a prediction at every position, not just the last one. Position 0 predicts what comes after the first token, position 1 predicts what comes after the first two tokens, and so on:

Position	Sees tokens	Predicts
0	”The”	What comes after “The”
1	”The cat”	What comes after “The cat”
2	”The cat sat”	What comes after “The cat sat”
3	”The cat sat on”	What comes after “The cat sat on”

The causal mask ensures that each position can only attend to itself and earlier positions — never to future tokens. This is what makes the model autoregressive: it generates text one token at a time, left to right.

9. Exercises

Exercise 1: Scale Up the Model

Create a GPTConfig with GPT-2 Small dimensions (d_model=768, n_heads=12, n_layers=12, d_ff=3072, vocab_size=50257, max_seq_len=1024). Count the parameters. Does it match ~124M?

Solution

config = GPTConfig(
    vocab_size=50257,
    d_model=768,
    n_heads=12,
    n_layers=12,
    d_ff=3072,
    max_seq_len=1024,
    dropout=0.1,
)

model = GPTModel(config)
total = count_parameters(model)
print(f"Total parameters: {total:,}")

# Break it down
print(f"\nToken embedding:    {config.vocab_size * config.d_model:,}")
print(f"Position embedding: {config.max_seq_len * config.d_model:,}")

# Per block
qkv = config.d_model * 3 * config.d_model + 3 * config.d_model
out = config.d_model * config.d_model + config.d_model
ln = 2 * (2 * config.d_model)
ffn1 = config.d_model * config.d_ff + config.d_ff
ffn2 = config.d_ff * config.d_model + config.d_model
block_total = qkv + out + ln + ffn1 + ffn2
print(f"Per block:          {block_total:,}")
print(f"All {config.n_layers} blocks:       {block_total * config.n_layers:,}")
print(f"Final norm:         {2 * config.d_model:,}")
print(f"Output proj:        {config.d_model * config.vocab_size:,}")

# The total should be approximately 124M parameters
# (exact number depends on bias settings)

You should get approximately 124 million parameters — matching GPT-2 Small. The only difference from our tiny model is the numbers in the config.

Exercise 2: Visualize Logit Distribution

After a forward pass, take the logits at the last position and create a histogram. Then apply softmax and create another histogram. Compare the two distributions.

Solution

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Create and run the model
config = GPTConfig.tiny() if hasattr(GPTConfig, 'tiny') else GPTConfig()
model = GPTModel(config)
model.eval()

token_ids = torch.tensor([[42, 15, 87, 203, 11, 55]])
with torch.no_grad():
    logits = model(token_ids)

# Get logits and probabilities for the last position
last_logits = logits[0, -1, :].numpy()
last_probs = torch.softmax(logits[0, -1, :], dim=-1).numpy()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logits histogram
axes[0].hist(last_logits, bins=50, color='steelblue', edgecolor='black')
axes[0].set_title("Raw Logits (before softmax)", fontsize=14)
axes[0].set_xlabel("Logit value")
axes[0].set_ylabel("Count")

# Probability histogram
axes[1].hist(last_probs, bins=50, color='coral', edgecolor='black')
axes[1].set_title("Probabilities (after softmax)", fontsize=14)
axes[1].set_xlabel("Probability")
axes[1].set_ylabel("Count")

plt.tight_layout()
plt.savefig("logits_vs_probs.png", dpi=150)
print("Plot saved to logits_vs_probs.png")
plt.close()

print(f"Logits range: [{last_logits.min():.3f}, {last_logits.max():.3f}]")
print(f"Probabilities range: [{last_probs.min():.6f}, {last_probs.max():.6f}]")
print(f"Probabilities sum: {last_probs.sum():.6f}")

What you’ll observe: The logits form a roughly bell-shaped distribution centered near 0 (since the model is untrained with randomly initialized weights). After softmax, the probabilities are all clustered near $1/\text{vocab_size}$ — a nearly uniform distribution. After training, the probability histogram would show a few tall spikes (high-probability words) and many values near zero.

Exercise 3: Add a Method to Print Shape at Each Layer

Add a forward_with_shapes() method to the GPTModel class that prints the tensor shape after every layer automatically. Then use it to trace a forward pass.

Solution

class GPTModelWithTracing(GPTModel):
    """GPTModel subclass that prints shapes at every layer."""

    def forward_with_shapes(self, token_ids):
        batch_size, seq_len = token_ids.shape
        print(f"Input:                 {token_ids.shape}")

        tok_emb = self.token_embedding(token_ids)
        print(f"Token embedding:       {tok_emb.shape}")

        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.position_embedding(positions)
        print(f"Position embedding:    {pos_emb.shape}")

        x = self.dropout(tok_emb + pos_emb)
        print(f"Combined + dropout:    {x.shape}")

        causal_mask = torch.tril(
            torch.ones(seq_len, seq_len, device=token_ids.device)
        )

        for i, block in enumerate(self.blocks):
            x = block(x, mask=causal_mask)
            print(f"After block {i}:         {x.shape}")

        x = self.final_norm(x)
        print(f"After final norm:      {x.shape}")

        logits = self.output_proj(x)
        print(f"Output logits:         {logits.shape}")

        return logits


# Test it
config = GPTConfig()
tracing_model = GPTModelWithTracing(config)
tracing_model.eval()

token_ids = torch.tensor([[1, 2, 3, 4, 5]])
with torch.no_grad():
    logits = tracing_model.forward_with_shapes(token_ids)

Output:

Input:                 torch.Size([1, 5])
Token embedding:       torch.Size([1, 5, 128])
Position embedding:    torch.Size([5, 128])
Combined + dropout:    torch.Size([1, 5, 128])
After block 0:         torch.Size([1, 5, 128])
After block 1:         torch.Size([1, 5, 128])
After final norm:      torch.Size([1, 5, 128])
Output logits:         torch.Size([1, 5, 1000])

This is a useful debugging tool. When you modify the architecture and something breaks, forward_with_shapes() will immediately show you where the tensor shape went wrong.

Exercise 4: Weight Tying

In many language models, the token embedding matrix and the output projection matrix share the same weights — this is called weight tying. Modify the GPTModel so that output_proj.weight points to the same tensor as token_embedding.weight. How does this change the parameter count?

Solution

class GPTModelTied(nn.Module):
    """GPT model with weight tying between input and output embeddings."""

    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        self.token_embedding = nn.Embedding(config.vocab_size, config.d_model)
        self.position_embedding = nn.Embedding(config.max_seq_len, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

        self.blocks = nn.ModuleList([
            TransformerBlock(config.d_model, config.n_heads,
                             config.d_ff, config.dropout)
            for _ in range(config.n_layers)
        ])

        self.final_norm = nn.LayerNorm(config.d_model)
        self.output_proj = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # Weight tying: share embedding weights with output projection
        self.output_proj.weight = self.token_embedding.weight

    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        tok_emb = self.token_embedding(token_ids)
        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.position_embedding(positions)
        x = self.dropout(tok_emb + pos_emb)

        causal_mask = torch.tril(
            torch.ones(seq_len, seq_len, device=token_ids.device)
        )

        for block in self.blocks:
            x = block(x, mask=causal_mask)

        x = self.final_norm(x)
        logits = self.output_proj(x)
        return logits


# Compare parameter counts
config = GPTConfig()

model_no_tying = GPTModel(config)
model_with_tying = GPTModelTied(config)

params_no_tying = count_parameters(model_no_tying)
params_with_tying = count_parameters(model_with_tying)

print(f"Without weight tying: {params_no_tying:,} parameters")
print(f"With weight tying:    {params_with_tying:,} parameters")
print(f"Saved:                {params_no_tying - params_with_tying:,} parameters")
# Saved: vocab_size × d_model = 1000 × 128 = 128,000 parameters

Weight tying saves vocab_size × d_model parameters. For our tiny model that’s 128K out of 661K (~19%). For GPT-2 with vocab_size=50257 and d_model=768, that’s 38.6M parameters saved — significant! The intuition is that the same “understanding” of words should be used both when reading input tokens and when predicting output tokens.

Summary

In this chapter, we assembled every component from the previous chapters into a complete GPT-style language model:

TransformerBlock wraps attention + feed-forward + layer normalization into a single module with matching input/output shapes (batch_size, seq_len, d_model).
Stacking is possible because every block preserves the tensor shape. We use nn.ModuleList to stack N identical blocks, each with its own learned weights.
The full GPT model combines token embeddings, learned positional embeddings, N transformer blocks, a final LayerNorm, and an output projection into one nn.Module.
Logits are the model’s raw output — scores for every word in the vocabulary at every position. Apply softmax to convert them to probabilities.
Parameter counting reveals that our tiny 661K-parameter model has the exact same architecture as GPT-2’s 124M-parameter model — only the dimensions differ.
Configuration dataclasses keep hyperparameters organized and make scaling the model trivial.
A complete forward pass transforms token IDs (batch_size, seq_len) into logits (batch_size, seq_len, vocab_size) through seven clearly defined steps.

The model is built. It can take in tokens and produce predictions. But right now, those predictions are random nonsense — the weights are initialized randomly and the model has never seen real text.

In the next chapter, we’ll fix that. Chapter 7 introduces the training loop: how to show the model millions of sentences, measure how wrong its predictions are (the loss function), and adjust its 661K parameters to make better predictions. That’s where the model goes from random noise to something that actually understands language.