Skip to main content
building large language models from scratch a beginners guide with python and pytorch

Embeddings — Giving Words Meaning in Numbers

33 min read Chapter 4 of 11
Summary

This chapter explains why raw token IDs are...

This chapter explains why raw token IDs are insufficient for neural networks and introduces dense embeddings as learned numerical representations that capture word meaning. Starting with one-hot encoding and its limitations, it progresses to dense embedding vectors, embedding lookup tables implemented from scratch and with PyTorch nn.Embedding, and positional encodings that preserve word order information. Sinusoidal positional encodings are derived and implemented. The chapter concludes with embedding visualization showing semantic clustering.

Embeddings — Giving Words Meaning in Numbers

In the previous chapter, you learned how to turn raw text into token IDs — integers that represent each word or subword in a vocabulary. You might have a vocabulary where “cat” is token 42, “dog” is token 87, and “elephant” is token 1534.

But here’s the problem: those numbers are completely arbitrary. The number 42 doesn’t mean anything about cats. The number 87 doesn’t encode anything about dogs. And the fact that 87 is closer to 42 than to 1534 tells us nothing about whether dogs are more similar to cats than to elephants.

Neural networks operate on numbers. If we feed them meaningless numbers, they’ll struggle to learn meaningful patterns. This chapter is about solving that problem — giving every word a numerical representation that actually means something.

By the end of this chapter, you will:

  • Understand why raw token IDs are insufficient for neural networks
  • Implement one-hot encoding and understand why it fails at scale
  • Build dense embedding lookup tables from scratch and with PyTorch
  • Understand how embeddings learn meaning through training
  • Implement positional encodings that capture word order
  • Visualize embeddings to see semantic clustering in action

Let’s begin.


1. Why Embeddings?

The Map Analogy

Imagine you’re given a list of cities and asked to say which ones are “similar.” Someone hands you this list:

CityID Number
Paris7
Tokyo23
Lyon142
Osaka89
Berlin51

Can you tell which cities are similar from those ID numbers? Of course not. The numbers are just arbitrary labels.

Now imagine instead you’re given each city’s coordinates — latitude and longitude:

CityLatitudeLongitude
Paris48.862.35
Tokyo35.68139.69
Lyon45.764.83
Osaka34.69135.50
Berlin52.5213.40

Now similarity is obvious. Paris and Lyon have nearby coordinates — they’re both in France. Tokyo and Osaka are close — both in Japan. The numbers themselves encode meaningful relationships.

That’s exactly what embeddings do for words. Instead of a single arbitrary ID, each word gets a list of numbers (a vector) that places it on a “map” where similar words sit near each other. Words like “king” and “queen” end up close together. Words like “cat” and “dog” end up in the same neighborhood. Words like “cat” and “democracy” end up far apart.

Why Can’t Neural Networks Just Use Token IDs?

You might wonder: “The neural network is going to learn patterns anyway. Why can’t it just figure out that token 42 means cat?”

There are three problems:

Problem 1: Magnitude is meaningless. If “cat” is 42 and “dog” is 87, a neural network performing math on these numbers would treat “dog” as roughly twice as important or large as “cat.” That’s nonsensical. The numbers were assigned in alphabetical or frequency order — there’s no meaning to their relative sizes.

Problem 2: Distance is meaningless. The “distance” between 42 and 87 is 45, while the distance between 42 and 1534 is 1492. But cats are arguably more similar to elephants (both are animals) than those distances suggest, and the closeness of 42 to 87 says nothing about cat–dog similarity.

Problem 3: No room for nuance. A single number can only encode one dimension of meaning. But words are rich — “bank” relates to “river” and to “money.” A single number can’t capture both relationships. We need multiple numbers per word.


2. One-Hot Encoding — The Naive Approach

The simplest way to turn token IDs into vectors is one-hot encoding: create a vector as long as your vocabulary, put a 1 at the position corresponding to the token ID, and fill the rest with 0s.

Implementing One-Hot Encoding

import torch

# Our tiny vocabulary
vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4, "dog": 5, "ran": 6}
vocab_size = len(vocab)  # 7

def one_hot_encode(token_id, vocab_size):
    """Create a one-hot vector for a single token."""
    vector = torch.zeros(vocab_size)
    vector[token_id] = 1.0
    return vector

# Encode some words
cat_vector = one_hot_encode(vocab["cat"], vocab_size)
dog_vector = one_hot_encode(vocab["dog"], vocab_size)
mat_vector = one_hot_encode(vocab["mat"], vocab_size)

print(f"'cat' one-hot: {cat_vector}")
# 'cat' one-hot: tensor([0., 1., 0., 0., 0., 0., 0.])

print(f"'dog' one-hot: {dog_vector}")
# 'dog' one-hot: tensor([0., 0., 0., 0., 0., 1., 0.])

print(f"'mat' one-hot: {mat_vector}")
# 'mat' one-hot: tensor([0., 0., 0., 0., 1., 0., 0.])

Let’s encode an entire sentence:

def encode_sentence(sentence, vocab):
    """One-hot encode every word in a sentence."""
    words = sentence.lower().split()
    vocab_size = len(vocab)
    encoded = torch.zeros(len(words), vocab_size)
    for i, word in enumerate(words):
        if word in vocab:
            encoded[i][vocab[word]] = 1.0
    return encoded

sentence = "the cat sat on the mat"
encoded = encode_sentence(sentence, vocab)
print(f"Sentence: '{sentence}'")
print(f"Shape: {encoded.shape}")  # torch.Size([6, 7])
print(f"Encoded:\n{encoded}")

Output:

Sentence: 'the cat sat on the mat'
Shape: torch.Size([6, 7])
Encoded:
tensor([[1., 0., 0., 0., 0., 0., 0.],   # the
        [0., 1., 0., 0., 0., 0., 0.],   # cat
        [0., 0., 1., 0., 0., 0., 0.],   # sat
        [0., 0., 0., 1., 0., 0., 0.],   # on
        [1., 0., 0., 0., 0., 0., 0.],   # the (same as first)
        [0., 0., 0., 0., 1., 0., 0.]])  # mat

Why One-Hot Encoding Fails

One-hot encoding seems clean, but it has fatal flaws:

Flaw 1: Astronomical size. Real vocabularies have 30,000–100,000 tokens. Each one-hot vector would be 50,000+ numbers long, with just a single 1 and the rest 0s. That’s incredibly wasteful. A sentence of 512 tokens would become a matrix of shape (512, 50000) — 25.6 million numbers, almost all zeros.

Flaw 2: No similarity information. Let’s check the similarity between “cat” and “dog” versus “cat” and “mat”:

# Cosine similarity: measures how similar two vectors are
# 1.0 = identical direction, 0.0 = completely unrelated

def cosine_similarity(a, b):
    return torch.dot(a, b) / (torch.norm(a) * torch.norm(b))

cat_dog_sim = cosine_similarity(cat_vector, dog_vector)
cat_mat_sim = cosine_similarity(cat_vector, mat_vector)

print(f"Similarity(cat, dog) = {cat_dog_sim.item():.4f}")  # 0.0000
print(f"Similarity(cat, mat) = {cat_mat_sim.item():.4f}")  # 0.0000

Both similarities are exactly zero. In one-hot encoding, every word is equally different from every other word. “Cat” is just as different from “dog” as it is from “democracy” or “photosynthesis.” That’s clearly wrong — we want the representation to know that cats and dogs are both animals.

Flaw 3: No learning possible from the encoding itself. One-hot vectors are fixed. They don’t change during training. The network has to learn from scratch that position 1 (cat) and position 5 (dog) are related, with no help from the representation itself.

We need something better.


3. Dense Embeddings — The Solution

Instead of a sparse vector of 50,000 numbers (with one 1 and the rest 0s), what if we represented each word with a short, dense vector — say, 4 numbers — where every number carries meaning?

"cat"  → [0.23, -0.87,  0.45,  0.12]
"dog"  → [0.25, -0.82,  0.41,  0.15]
"mat"  → [-0.56, 0.33, -0.21,  0.78]

Now “cat” and “dog” have similar numbers — their vectors are close together in 4-dimensional space. “Mat” has very different numbers — it’s far away. The representation itself encodes meaning.

These short, dense vectors are called embeddings, and the number of dimensions (4 in this example) is called the embedding dimension or d_model. In real transformers, this is typically 256, 512, or 768.

Building an Embedding Table from Scratch

At its simplest, an embedding is just a lookup table: a big matrix where each row is the embedding vector for one token ID.

import numpy as np

# Vocabulary
vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4, "dog": 5, "ran": 6}
vocab_size = len(vocab)    # 7 tokens
embed_dim = 4              # 4 numbers per word

# Create the embedding table: random numbers to start
# Shape: (vocab_size, embed_dim) = (7, 4)
np.random.seed(42)
embedding_table = np.random.randn(vocab_size, embed_dim).astype(np.float32)

print("Embedding table (our lookup dictionary):")
print(f"Shape: ({vocab_size}, {embed_dim})")
print()
for word, idx in vocab.items():
    print(f"  '{word}' (ID {idx}) → {embedding_table[idx]}")

Output:

Embedding table (our lookup dictionary):
Shape: (7, 4)

  'the' (ID 0) → [ 0.4967  -0.1383   0.6477   1.5230]
  'cat' (ID 1) → [-0.2342  -0.2341   1.5792  -0.7205]
  'sat' (ID 2) → [-0.2685   0.8024  -1.0672  -0.1044]
  'on'  (ID 3) → [ 0.0294   0.2590  -0.4501  -0.5863]
  'mat' (ID 4) → [ 0.3429  -1.1171   0.5625  -0.2114]
  'dog' (ID 5) → [ 1.6402   0.2600   0.1050  -0.4113]
  'ran' (ID 6) → [ 0.4404  -0.9937  -0.1971   0.5534]

To look up a word’s embedding, we just grab its row:

def lookup_embedding(token_id, embedding_table):
    """Look up the embedding for a single token."""
    return embedding_table[token_id]

def embed_sentence(sentence, vocab, embedding_table):
    """Look up embeddings for every word in a sentence."""
    words = sentence.lower().split()
    ids = [vocab[w] for w in words if w in vocab]
    return np.array([embedding_table[i] for i in ids])

# Embed a sentence
sentence = "the cat sat on the mat"
embedded = embed_sentence(sentence, vocab, embedding_table)
print(f"\nSentence: '{sentence}'")
print(f"Shape: {embedded.shape}")  # (6, 4) — 6 words, 4 dimensions each
print(f"\nEmbedded representations:")
for word, vec in zip(sentence.split(), embedded):
    print(f"  '{word}' → {vec}")

That’s all an embedding layer does — it’s a lookup table that converts integer IDs into dense vectors. The magic is that these vectors are learnable: during training, the network adjusts them so that useful patterns emerge.

PyTorch nn.Embedding

PyTorch provides a built-in embedding layer that does exactly what we just built — but faster, on GPU, and with automatic gradient tracking:

import torch
import torch.nn as nn

vocab_size = 7
embed_dim = 4

# Create an embedding layer
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)

print(f"Embedding layer: {embedding_layer}")
print(f"Weight matrix shape: {embedding_layer.weight.shape}")  # torch.Size([7, 4])
print(f"\nThe weight matrix IS the embedding table:")
print(embedding_layer.weight.data)

Using it is simple — feed in token IDs, get back embedding vectors:

# Single token
token_id = torch.tensor(1)  # "cat"
cat_embedding = embedding_layer(token_id)
print(f"\n'cat' embedding: {cat_embedding}")
print(f"Shape: {cat_embedding.shape}")  # torch.Size([4])

# A batch of token IDs (a sentence)
token_ids = torch.tensor([0, 1, 2, 3, 0, 4])  # "the cat sat on the mat"
sentence_embeddings = embedding_layer(token_ids)
print(f"\nSentence embeddings shape: {sentence_embeddings.shape}")
# torch.Size([6, 4]) — 6 tokens, each with 4-dimensional embedding
print(f"\nSentence embeddings:\n{sentence_embeddings}")

Notice: embedding_layer.weight is a (vocab_size, embed_dim) matrix, and embedding_layer(token_id) simply grabs row token_id from that matrix. It’s literally a lookup, not a computation.

Shape Tracking

Let’s be precise about shapes at every step, because shape tracking will save you hours of debugging later:

vocab_size = 10000   # realistic vocabulary size
embed_dim = 256      # realistic embedding dimension
seq_len = 32         # sentence length (number of tokens)
batch_size = 8       # processing 8 sentences at once

embedding = nn.Embedding(vocab_size, embed_dim)

# Input: batch of token ID sequences
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
print(f"Input shape:  {input_ids.shape}")    # torch.Size([8, 32])
print(f"  → {batch_size} sentences, each {seq_len} tokens long")
print(f"  → Each value is an integer in [0, {vocab_size})")

# Output: batch of embedding sequences
output = embedding(input_ids)
print(f"\nOutput shape: {output.shape}")     # torch.Size([8, 32, 256])
print(f"  → {batch_size} sentences")
print(f"  → Each sentence has {seq_len} tokens")
print(f"  → Each token is now a {embed_dim}-dimensional vector")

The key transformation: integers → dense vectors. The input is (batch_size, seq_len) integers. The output is (batch_size, seq_len, embed_dim) floats. Each integer got “expanded” into a vector of embed_dim numbers.


4. How Embeddings Learn

You might be wondering: “Those embedding vectors started as random numbers. How do they end up encoding meaning?”

The answer: through training. Just like the weights in a neural network (which we covered in Chapter 2), embedding vectors are parameters that get adjusted by backpropagation. As the model trains on text data, it discovers that certain patterns work better than others, and the embeddings shift accordingly.

The Intuition

Imagine you’re training a model to predict the next word. If the model sees:

  • “The king sat on the ___” → “throne”
  • “The queen sat on the ___” → “throne”

Both “king” and “queen” predict “throne” in this context. The gradient updates will push their embedding vectors in similar directions — toward whatever configuration helps predict “throne.” Over millions of such examples, “king” and “queen” end up with similar embeddings because they appear in similar contexts.

This is the distributional hypothesis in linguistics: words that appear in similar contexts tend to have similar meanings. Embeddings learn this automatically.

A Tiny Training Example

Let’s see this in action with a minimal example. We’ll create a tiny dataset where “king” and “queen” appear in similar contexts, and watch their embeddings converge:

import torch
import torch.nn as nn
import torch.optim as optim

# Tiny vocabulary
vocab = {"king": 0, "queen": 1, "man": 2, "woman": 3,
         "throne": 4, "crown": 5, "sword": 6, "castle": 7}
vocab_size = len(vocab)
embed_dim = 8

# Training data: (context_word, target_word) pairs
# "king" and "queen" share similar context words
training_pairs = [
    # king appears near throne, crown, castle
    (vocab["king"], vocab["throne"]),
    (vocab["king"], vocab["crown"]),
    (vocab["king"], vocab["castle"]),
    # queen appears near throne, crown, castle (same contexts!)
    (vocab["queen"], vocab["throne"]),
    (vocab["queen"], vocab["crown"]),
    (vocab["queen"], vocab["castle"]),
    # man appears near sword
    (vocab["man"], vocab["sword"]),
    (vocab["man"], vocab["castle"]),
    # woman appears near crown
    (vocab["woman"], vocab["crown"]),
    (vocab["woman"], vocab["castle"]),
]

# Simple model: embedding → predict context word
class SimpleEmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.linear = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        embeds = self.embeddings(x)           # (batch, embed_dim)
        logits = self.linear(embeds)          # (batch, vocab_size)
        return logits

# Create model and optimizer
model = SimpleEmbeddingModel(vocab_size, embed_dim)
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

# Check embeddings BEFORE training
with torch.no_grad():
    king_emb_before = model.embeddings.weight[vocab["king"]].clone()
    queen_emb_before = model.embeddings.weight[vocab["queen"]].clone()
    cos_sim_before = torch.nn.functional.cosine_similarity(
        king_emb_before.unsqueeze(0),
        queen_emb_before.unsqueeze(0)
    )
    print(f"Before training:")
    print(f"  king  embedding: {king_emb_before[:4].tolist()}...")
    print(f"  queen embedding: {queen_emb_before[:4].tolist()}...")
    print(f"  Cosine similarity(king, queen): {cos_sim_before.item():.4f}")

# Train
inputs = torch.tensor([pair[0] for pair in training_pairs])
targets = torch.tensor([pair[1] for pair in training_pairs])

for epoch in range(500):
    logits = model(inputs)
    loss = loss_fn(logits, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"  Epoch {epoch+1}, Loss: {loss.item():.4f}")

# Check embeddings AFTER training
with torch.no_grad():
    king_emb_after = model.embeddings.weight[vocab["king"]]
    queen_emb_after = model.embeddings.weight[vocab["queen"]]
    man_emb_after = model.embeddings.weight[vocab["man"]]

    sim_kq = torch.nn.functional.cosine_similarity(
        king_emb_after.unsqueeze(0), queen_emb_after.unsqueeze(0)
    )
    sim_km = torch.nn.functional.cosine_similarity(
        king_emb_after.unsqueeze(0), man_emb_after.unsqueeze(0)
    )

    print(f"\nAfter training:")
    print(f"  king  embedding: {king_emb_after[:4].tolist()}...")
    print(f"  queen embedding: {queen_emb_after[:4].tolist()}...")
    print(f"  Cosine similarity(king, queen): {sim_kq.item():.4f}")
    print(f"  Cosine similarity(king, man):   {sim_km.item():.4f}")

You’ll see that after training, “king” and “queen” are much more similar to each other than either is to “man” — because they shared the same context words (throne, crown, castle). The embeddings learned that king and queen are semantically close.

Key insight: Nobody told the model that kings and queens are related. It discovered this on its own, just from seeing them in similar contexts. This is how all modern language models learn word meaning.


5. Positional Encodings — Why Word Order Matters

We now have beautiful embedding vectors for each word. But there’s a critical piece missing: word order.

Consider these two sentences:

  • “The dog bites the man
  • “The man bites the dog

Both sentences contain exactly the same words with exactly the same embeddings. If we just add up or concatenate the embeddings, we get identical representations for two sentences with completely different meanings. The model would have no idea who is doing the biting.

In a recurrent neural network (RNN), word order is baked in because words are processed one at a time, in sequence. But transformers — the architecture we’re building toward — process all words simultaneously. They see all the token embeddings at once, in parallel. Without some additional signal, the transformer has no way of knowing that “dog” came first and “man” came last, or vice versa.

The solution: add a positional encoding to each token’s embedding that tells the model where in the sentence that token sits.

The Coordinate Analogy

Think of it this way. You have a row of students in a classroom, and you want to tell someone not just who is in the class but where each student is sitting.

The student’s name is like the token embedding — it tells you who they are. The seat number is like the positional encoding — it tells you where they are. You need both pieces of information.

In a transformer, we literally add the position information to the token embedding:

final_embedding = token_embedding + positional_encoding

The result is a single vector that encodes both what the word means and where it appears in the sequence.

Sinusoidal Positional Encoding

The original “Attention Is All You Need” paper (Vaswani et al., 2017) proposed a clever scheme using sine and cosine waves at different frequencies. The idea is elegant:

  • Each position gets a unique pattern of numbers.
  • Nearby positions have similar patterns (so the model can learn “this word is near that word”).
  • The patterns are deterministic — no learning required.

The formula uses alternating sine and cosine functions:

For position $pos$ and dimension $i$:

$$PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)$$

$$PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)$$

Don’t panic about the math. Here’s the plain-English version:

  • Each dimension of the positional encoding oscillates like a wave.
  • Even dimensions (0, 2, 4, …) use sine waves.
  • Odd dimensions (1, 3, 5, …) use cosine waves.
  • Each pair of dimensions oscillates at a different frequency — dimension 0–1 oscillates rapidly, dimension 2–3 more slowly, and so on.

It’s like a clock: the seconds hand moves fast, the minutes hand moves slowly, and the hours hand moves very slowly. Together, they give a unique “time signature” for every moment. Similarly, the combination of fast and slow oscillations gives a unique pattern for every position.

Implementing Positional Encoding from Scratch

import torch
import numpy as np

def positional_encoding_from_scratch(seq_len, d_model):
    """
    Create sinusoidal positional encodings.

    Args:
        seq_len: number of positions (length of sequence)
        d_model: dimension of each encoding (must match embedding dimension)

    Returns:
        Tensor of shape (seq_len, d_model)
    """
    pe = np.zeros((seq_len, d_model))

    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            # The "wavelength" gets longer as i increases
            denominator = 10000 ** (i / d_model)

            pe[pos, i]     = np.sin(pos / denominator)   # even dimensions
            pe[pos, i + 1] = np.cos(pos / denominator)   # odd dimensions

    return torch.tensor(pe, dtype=torch.float32)

# Generate positional encodings for 10 positions, 8 dimensions
seq_len = 10
d_model = 8
pe = positional_encoding_from_scratch(seq_len, d_model)

print(f"Positional encoding shape: {pe.shape}")  # torch.Size([10, 8])
print(f"\nPosition 0: {pe[0].tolist()}")
print(f"Position 1: {pe[1].tolist()}")
print(f"Position 2: {pe[2].tolist()}")

Let’s see the alternating sin/cos pattern more clearly:

print("\nThe alternating sin/cos pattern:")
print(f"{'Pos':<5}", end="")
for i in range(d_model):
    func = "sin" if i % 2 == 0 else "cos"
    print(f"  dim{i}({func})", end="")
print()

for pos in range(4):
    print(f"{pos:<5}", end="")
    for i in range(d_model):
        print(f"  {pe[pos, i]:>9.4f}", end="")
    print()

Notice the pattern:

  • Dimensions 0–1 (fastest frequency): values change dramatically between adjacent positions.
  • Dimensions 6–7 (slowest frequency): values change very slowly between positions.
  • This multi-scale pattern means the model can detect both fine-grained (nearby) and coarse (far apart) positional relationships.

Efficient PyTorch Implementation

The nested loop above is clear but slow. Here’s the vectorized version:

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len=5000):
        super().__init__()

        # Create the positional encoding matrix
        pe = torch.zeros(max_seq_len, d_model)

        # Position indices: shape (max_seq_len, 1)
        position = torch.arange(0, max_seq_len, dtype=torch.float32).unsqueeze(1)

        # Denominator term: shape (d_model/2,)
        # Using log for numerical stability: 10000^(2i/d) = exp(2i * log(10000) / d)
        div_term = torch.exp(
            torch.arange(0, d_model, 2, dtype=torch.float32)
            * -(math.log(10000.0) / d_model)
        )

        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)  # even dims
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dims

        # Add batch dimension: (1, max_seq_len, d_model)
        pe = pe.unsqueeze(0)

        # Register as buffer (not a parameter — doesn't get trained)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
        Returns:
            x + positional encoding, same shape
        """
        seq_len = x.size(1)
        # Add positional encoding (broadcasts over batch dimension)
        return x + self.pe[:, :seq_len, :]

# Test it
d_model = 8
pos_encoder = PositionalEncoding(d_model, max_seq_len=100)

# Print a few positional encoding values
pe_values = pos_encoder.pe[0, :5, :]  # first 5 positions
print(f"Positional encoding shape: {pos_encoder.pe.shape}")
# torch.Size([1, 100, 8])

print(f"\nFirst 5 positions:")
for pos in range(5):
    values = [f"{v:.4f}" for v in pe_values[pos].tolist()]
    print(f"  Position {pos}: [{', '.join(values)}]")

Key design decision: the positional encoding is registered as a buffer, not a parameter. This means:

  • It gets saved with the model.
  • It moves to GPU when the model does.
  • But it does NOT get updated during training — the patterns are fixed.

Why Sinusoidal?

You might ask: “Why not just use [0, 1, 2, 3, …] as position numbers?”

The sine/cosine approach has several advantages:

  1. Bounded values: The outputs are always between -1 and 1, regardless of position. Simple integers would grow without bound, dominating the embedding values.

  2. Relative positions: The sinusoidal pattern allows the model to learn relative positions. The “distance” between position 5 and position 8 has the same encoding pattern as the distance between position 100 and position 103. This is because $\sin(a + b)$ can be expressed as a linear combination of $\sin(a)$ and $\cos(a)$.

  3. Generalization: The model can handle sequences longer than any seen during training, because the sine/cosine functions are defined for any position.


6. Combining Token + Position Embeddings

Now let’s put it all together. In a real transformer, the input pipeline looks like this:

tokens → Token Embedding → + Positional Encoding → Transformer Input

Let’s implement the complete flow:

import torch
import torch.nn as nn
import math

class TransformerEmbedding(nn.Module):
    """
    Combines token embeddings with positional encodings.
    This is what gets fed into the transformer.
    """
    def __init__(self, vocab_size, d_model, max_seq_len=5000):
        super().__init__()

        # Token embedding: converts token IDs to dense vectors
        self.token_embedding = nn.Embedding(vocab_size, d_model)

        # Positional encoding: adds position information
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float32).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2, dtype=torch.float32)
            * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_seq_len, d_model)
        self.register_buffer('pe', pe)

        # Scaling factor (from the original paper)
        self.d_model = d_model

    def forward(self, token_ids):
        """
        Args:
            token_ids: (batch_size, seq_len) integer tensor
        Returns:
            (batch_size, seq_len, d_model) float tensor
        """
        seq_len = token_ids.size(1)

        # Step 1: Look up token embeddings
        tok_emb = self.token_embedding(token_ids)  # (batch, seq_len, d_model)
        print(f"  After token embedding: {tok_emb.shape}")

        # Step 2: Scale embeddings (helps with training stability)
        tok_emb = tok_emb * math.sqrt(self.d_model)
        print(f"  After scaling:         {tok_emb.shape}")

        # Step 3: Add positional encoding
        pos_enc = self.pe[:, :seq_len, :]           # (1, seq_len, d_model)
        print(f"  Positional encoding:   {pos_enc.shape}")

        output = tok_emb + pos_enc                   # broadcasting adds to each batch
        print(f"  After addition:        {output.shape}")

        return output

# Set up dimensions
vocab_size = 10000
d_model = 64
max_seq_len = 512
batch_size = 2
seq_len = 8

# Create the combined embedding layer
embedding = TransformerEmbedding(vocab_size, d_model, max_seq_len)

# Simulate a batch of tokenized sentences
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
print(f"Input token IDs shape: {token_ids.shape}")
print(f"Input token IDs (first sentence): {token_ids[0].tolist()}")
print()

# Forward pass
output = embedding(token_ids)
print(f"\nFinal output shape: {output.shape}")
print(f"  → {batch_size} sentences")
print(f"  → {seq_len} tokens per sentence")
print(f"  → {d_model} dimensions per token")

Output:

Input token IDs shape: torch.Size([2, 8])
Input token IDs (first sentence): [4721, 331, 8847, 1023, 5567, 209, 7743, 6118]

  After token embedding: torch.Size([2, 8, 64])
  After scaling:         torch.Size([2, 8, 64])
  Positional encoding:   torch.Size([1, 8, 64])
  After addition:        torch.Size([2, 8, 64])

Final output shape: torch.Size([2, 8, 64])
  → 2 sentences
  → 8 tokens per sentence
  → 64 dimensions per token

Let’s trace what happened to a single token:

with torch.no_grad():
    # Pick the first token in the first sentence
    token_id = token_ids[0, 0].item()
    print(f"Token ID: {token_id}")

    # Its raw embedding (before scaling)
    raw_emb = embedding.token_embedding.weight[token_id]
    print(f"Raw embedding (first 8 dims): {raw_emb[:8].tolist()}")

    # Its positional encoding (position 0)
    pos_enc = embedding.pe[0, 0, :]
    print(f"Positional enc (first 8 dims): {pos_enc[:8].tolist()}")

    # The combined result
    combined = output[0, 0, :]
    print(f"Combined output (first 8 dims): {combined[:8].tolist()}")
    print(f"\nVerification: raw * sqrt(d_model) + pos_enc ≈ combined")
    manual = raw_emb[:8] * math.sqrt(d_model) + pos_enc[:8]
    print(f"Manual calculation:             {manual.tolist()}")

The combined output is what flows into the rest of the transformer. Each token now carries two pieces of information encoded in a single vector:

  1. What the token is (from the token embedding)
  2. Where the token is (from the positional encoding)

Why Addition Instead of Concatenation?

You might wonder: why add the position to the token embedding? Why not concatenate them into a longer vector?

Addition works because of how neural networks process information. The subsequent layers (attention, feed-forward networks) apply linear transformations, and a linear transformation of a sum is the sum of the linear transformations. The network can learn to “separate” the positional and semantic information internally.

Concatenation would also work, but it doubles the embedding dimension (and thus the model size and computation). Addition keeps the dimension the same and works well in practice.


7. Visualizing Embeddings

To really drive home how embeddings capture meaning, let’s visualize them. We’ll train embeddings on a small dataset and plot them in 2D.

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend
import matplotlib.pyplot as plt
import numpy as np

# A slightly larger vocabulary with semantic groups
vocab = {
    # Animals
    "cat": 0, "dog": 1, "fish": 2, "bird": 3,
    # Colors
    "red": 4, "blue": 5, "green": 6,
    # Actions
    "run": 7, "jump": 8, "swim": 9,
    # Royalty
    "king": 10, "queen": 11, "prince": 12,
}
vocab_size = len(vocab)
embed_dim = 2  # 2D so we can plot directly!

# Training pairs: (word, context_word)
# Words in the same category share contexts
training_pairs = [
    # Animals appear together
    (0, 1), (1, 0), (0, 2), (2, 0), (1, 3), (3, 1), (2, 3), (3, 2),
    # Colors appear together
    (4, 5), (5, 4), (4, 6), (6, 4), (5, 6), (6, 5),
    # Actions appear together
    (7, 8), (8, 7), (7, 9), (9, 7), (8, 9), (9, 8),
    # Royalty appears together
    (10, 11), (11, 10), (10, 12), (12, 10), (11, 12), (12, 11),
    # Animals + actions (some cross-group connections)
    (0, 7), (1, 7), (1, 8), (2, 9), (3, 8),
]

# Model
class EmbeddingViz(nn.Module):
    def __init__(self):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.linear = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        return self.linear(self.embeddings(x))

model = EmbeddingViz()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

inputs = torch.tensor([p[0] for p in training_pairs])
targets = torch.tensor([p[1] for p in training_pairs])

# Train
for epoch in range(1000):
    logits = model(inputs)
    loss = loss_fn(logits, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Extract learned embeddings
with torch.no_grad():
    embeddings = model.embeddings.weight.numpy()

# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 8))

# Color-code by category
categories = {
    "Animals": ([0, 1, 2, 3], "tab:blue"),
    "Colors":  ([4, 5, 6], "tab:red"),
    "Actions": ([7, 8, 9], "tab:green"),
    "Royalty": ([10, 11, 12], "tab:purple"),
}

id_to_word = {v: k for k, v in vocab.items()}

for cat_name, (indices, color) in categories.items():
    xs = embeddings[indices, 0]
    ys = embeddings[indices, 1]
    ax.scatter(xs, ys, c=color, s=100, label=cat_name, zorder=5)

    for idx in indices:
        ax.annotate(
            id_to_word[idx],
            (embeddings[idx, 0], embeddings[idx, 1]),
            textcoords="offset points",
            xytext=(8, 8),
            fontsize=12,
            fontweight='bold',
        )

ax.set_title("Learned Word Embeddings (2D)", fontsize=16)
ax.set_xlabel("Dimension 0", fontsize=12)
ax.set_ylabel("Dimension 1", fontsize=12)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("embedding_visualization.png", dpi=150, bbox_inches='tight')
print("Plot saved to embedding_visualization.png")
plt.close()

When you run this, you’ll see something remarkable: the words naturally cluster by category. Animals group together, colors group together, royalty groups together, and actions group together — even though nobody told the model about these categories. The model discovered the groupings purely from co-occurrence patterns.

What This Means for LLMs

In a real language model trained on billions of words:

  • “king” – “man” + “woman” ≈ “queen” (the famous word analogy)
  • “Paris” – “France” + “Germany” ≈ “Berlin”
  • Programming terms like “function”, “method”, “procedure” cluster together
  • Emotional words like “happy”, “joyful”, “elated” sit in the same neighborhood

The embedding space becomes a rich map of human language, organized by meaning, all learned automatically from text.


8. Putting It All Together — The Complete Embedding Pipeline

Let’s consolidate everything into a single, complete example that traces data through the full embedding pipeline:

import torch
import torch.nn as nn
import math

# ============================================================
# Step 1: Define vocabulary and tokenize
# ============================================================
vocab = {"[PAD]": 0, "the": 1, "cat": 2, "sat": 3, "on": 4,
         "mat": 5, "dog": 6, "ran": 7, "fast": 8}
vocab_size = len(vocab)
d_model = 16

print("=" * 60)
print("COMPLETE EMBEDDING PIPELINE")
print("=" * 60)

# Simulate tokenized input (2 sentences, padded to length 5)
sentences = [
    "the cat sat on mat",
    "the dog ran fast [PAD]",
]
token_ids = torch.tensor([
    [vocab[w] for w in sentences[0].split()],
    [vocab[w] for w in sentences[1].lower().split()],
])
print(f"\n1. Token IDs:")
print(f"   Shape: {token_ids.shape}")  # (2, 5)
for i, sent in enumerate(sentences):
    print(f"   Sentence {i}: '{sent}' → {token_ids[i].tolist()}")

# ============================================================
# Step 2: Token embedding
# ============================================================
token_emb_layer = nn.Embedding(vocab_size, d_model)
tok_emb = token_emb_layer(token_ids)
print(f"\n2. Token Embeddings:")
print(f"   Shape: {tok_emb.shape}")  # (2, 5, 16)
print(f"   Each of {token_ids.shape[1]} tokens is now a {d_model}-dim vector")

# ============================================================
# Step 3: Scale
# ============================================================
tok_emb_scaled = tok_emb * math.sqrt(d_model)
print(f"\n3. Scaled Token Embeddings:")
print(f"   Shape: {tok_emb_scaled.shape}")  # (2, 5, 16)
print(f"   Multiplied by sqrt({d_model}) = {math.sqrt(d_model):.2f}")

# ============================================================
# Step 4: Positional encoding
# ============================================================
pe = torch.zeros(token_ids.size(1), d_model)
position = torch.arange(0, token_ids.size(1), dtype=torch.float32).unsqueeze(1)
div_term = torch.exp(
    torch.arange(0, d_model, 2, dtype=torch.float32) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)  # (1, 5, 16)

print(f"\n4. Positional Encoding:")
print(f"   Shape: {pe.shape}")  # (1, 5, 16)
print(f"   Position 0, first 4 dims: {pe[0, 0, :4].tolist()}")
print(f"   Position 1, first 4 dims: {pe[0, 1, :4].tolist()}")

# ============================================================
# Step 5: Add them together
# ============================================================
final = tok_emb_scaled + pe  # (2, 5, 16) + (1, 5, 16) → (2, 5, 16)
print(f"\n5. Final Embedding (token + position):")
print(f"   Shape: {final.shape}")  # (2, 5, 16)
print(f"   This is what enters the transformer!")

# Verify shapes at every step
print(f"\n{'='*60}")
print(f"Shape Summary:")
print(f"  Input token IDs:         {token_ids.shape}")
print(f"  After token embedding:   {tok_emb.shape}")
print(f"  After scaling:           {tok_emb_scaled.shape}")
print(f"  Positional encoding:     {pe.shape}")
print(f"  Final (token + pos):     {final.shape}")
print(f"{'='*60}")

9. Key Takeaways

Before we move on, let’s summarize the critical ideas from this chapter:

  1. Token IDs are arbitrary — the numbers assigned to words carry no semantic meaning. The model needs a richer representation.

  2. One-hot encoding is wasteful — it creates sparse, high-dimensional vectors where every word is equally different from every other word.

  3. Dense embeddings are learned — each word is represented by a short, dense vector. These vectors start random and adjust during training so that similar words end up with similar vectors.

  4. Embeddings are just lookup tablesnn.Embedding is a matrix of shape (vocab_size, d_model). Looking up a token is grabbing a row from this matrix.

  5. Position matters — transformers process all tokens in parallel, so they need explicit positional information. Sinusoidal positional encodings provide this using alternating sine and cosine waves at multiple frequencies.

  6. Token + position are added — the final input to the transformer is the sum of the token embedding and the positional encoding. This single vector carries both what and where information.

  7. Shapes to remember:

    • Input: (batch_size, seq_len) — integers
    • After embedding: (batch_size, seq_len, d_model) — floats
    • This shape is maintained through the entire transformer

10. Exercises

Exercise 1: Embedding Similarity

Create a vocabulary of 10 words from two categories (5 animals and 5 vehicles). Initialize an nn.Embedding with embed_dim=8. Without any training, compute the cosine similarity between all pairs. Then manually set the embedding weights so that words in the same category are similar and words in different categories are different. Verify with cosine similarity.

Solution
import torch
import torch.nn as nn

vocab = {
    "cat": 0, "dog": 1, "fish": 2, "bird": 3, "horse": 4,
    "car": 5, "truck": 6, "bus": 7, "bike": 8, "plane": 9
}
vocab_size = 10
embed_dim = 8

embedding = nn.Embedding(vocab_size, embed_dim)

# Before: embeddings are random, so similarities are random
print("Before (random embeddings):")
with torch.no_grad():
    cat_emb = embedding.weight[0]
    dog_emb = embedding.weight[1]
    car_emb = embedding.weight[5]
    sim_cat_dog = nn.functional.cosine_similarity(
        cat_emb.unsqueeze(0), dog_emb.unsqueeze(0)
    )
    sim_cat_car = nn.functional.cosine_similarity(
        cat_emb.unsqueeze(0), car_emb.unsqueeze(0)
    )
    print(f"  cosine(cat, dog) = {sim_cat_dog.item():.4f}")
    print(f"  cosine(cat, car) = {sim_cat_car.item():.4f}")

# Manually set embeddings: animals share one pattern, vehicles another
with torch.no_grad():
    # Animals: positive first 4 dims, near-zero last 4
    for i in range(5):
        embedding.weight[i] = torch.tensor(
            [1.0, 0.8, 0.9, 0.7, 0.1, -0.1, 0.0, 0.05]
        ) + torch.randn(8) * 0.1  # small noise for variety

    # Vehicles: near-zero first 4, positive last 4
    for i in range(5, 10):
        embedding.weight[i] = torch.tensor(
            [0.1, -0.1, 0.0, 0.05, 1.0, 0.8, 0.9, 0.7]
        ) + torch.randn(8) * 0.1

print("\nAfter (manually set embeddings):")
with torch.no_grad():
    cat_emb = embedding.weight[0]
    dog_emb = embedding.weight[1]
    car_emb = embedding.weight[5]
    truck_emb = embedding.weight[6]

    print(f"  cosine(cat, dog)   = {nn.functional.cosine_similarity(cat_emb.unsqueeze(0), dog_emb.unsqueeze(0)).item():.4f}")
    print(f"  cosine(car, truck) = {nn.functional.cosine_similarity(car_emb.unsqueeze(0), truck_emb.unsqueeze(0)).item():.4f}")
    print(f"  cosine(cat, car)   = {nn.functional.cosine_similarity(cat_emb.unsqueeze(0), car_emb.unsqueeze(0)).item():.4f}")

Animals have high similarity with each other (~0.9+), vehicles have high similarity with each other (~0.9+), and cross-category similarity is low (~0.1–0.3).

Exercise 2: Positional Encoding Analysis

Generate sinusoidal positional encodings with d_model=64 and seq_len=100. Compute the cosine similarity between position 0 and every other position. Plot the result. What pattern do you observe? Why does this matter?

Solution
import torch
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import math

def make_pe(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0, d_model, 2, dtype=torch.float32) * -(math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

pe = make_pe(100, 64)

# Cosine similarity of position 0 with all other positions
pos0 = pe[0].unsqueeze(0)  # (1, 64)
similarities = torch.nn.functional.cosine_similarity(pos0, pe, dim=1)  # (100,)

plt.figure(figsize=(12, 5))
plt.plot(similarities.numpy())
plt.xlabel("Position", fontsize=12)
plt.ylabel("Cosine Similarity with Position 0", fontsize=12)
plt.title("Positional Encoding Similarity (d_model=64)", fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("pe_similarity.png", dpi=150)
print("Plot saved to pe_similarity.png")
plt.close()

print(f"Similarity with position 1: {similarities[1].item():.4f}")
print(f"Similarity with position 5: {similarities[5].item():.4f}")
print(f"Similarity with position 50: {similarities[50].item():.4f}")
print(f"Similarity with position 99: {similarities[99].item():.4f}")

What you’ll observe: Nearby positions have high similarity, and similarity decreases with distance — but it oscillates rather than declining monotonically. This oscillatory pattern is a natural consequence of the sinusoidal functions and gives the model a rich set of relative-position signals to learn from.

Exercise 3: Complete Embedding Module

Build a complete TextEmbedding class that:

  1. Takes raw text strings as input
  2. Tokenizes them using a simple word-level vocabulary
  3. Pads sequences to the same length
  4. Applies token embeddings + positional encodings
  5. Returns the final tensor with shape (batch_size, max_seq_len, d_model)

Test it with a batch of 3 sentences of different lengths.

Solution
import torch
import torch.nn as nn
import math

class TextEmbedding(nn.Module):
    def __init__(self, vocab, d_model, max_seq_len=128):
        super().__init__()
        self.vocab = vocab
        self.pad_id = vocab.get("[PAD]", 0)
        self.unk_id = vocab.get("[UNK]", 1)
        self.d_model = d_model
        self.max_seq_len = max_seq_len

        vocab_size = len(vocab)
        self.token_embedding = nn.Embedding(vocab_size, d_model, padding_idx=self.pad_id)

        # Positional encoding
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float32).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2, dtype=torch.float32)
            * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def tokenize(self, text):
        """Simple whitespace tokenizer."""
        return [self.vocab.get(w, self.unk_id) for w in text.lower().split()]

    def collate(self, texts):
        """Tokenize and pad a batch of texts."""
        token_lists = [self.tokenize(t) for t in texts]
        max_len = min(max(len(t) for t in token_lists), self.max_seq_len)

        padded = []
        for tokens in token_lists:
            tokens = tokens[:max_len]  # truncate if needed
            tokens = tokens + [self.pad_id] * (max_len - len(tokens))  # pad
            padded.append(tokens)

        return torch.tensor(padded)

    def forward(self, texts):
        """
        Args:
            texts: list of strings
        Returns:
            (batch_size, max_seq_len, d_model) tensor
        """
        # Step 1: Tokenize and pad
        token_ids = self.collate(texts)
        print(f"Token IDs shape: {token_ids.shape}")

        # Step 2: Token embedding
        tok_emb = self.token_embedding(token_ids) * math.sqrt(self.d_model)
        print(f"Token embeddings shape: {tok_emb.shape}")

        # Step 3: Add positional encoding
        seq_len = token_ids.size(1)
        output = tok_emb + self.pe[:, :seq_len, :]
        print(f"Final output shape: {output.shape}")

        return output

# Build vocabulary
vocab = {
    "[PAD]": 0, "[UNK]": 1,
    "the": 2, "cat": 3, "sat": 4, "on": 5, "a": 6,
    "mat": 7, "dog": 8, "ran": 9, "fast": 10,
    "big": 11, "small": 12, "red": 13, "blue": 14,
}

# Create module
embed_module = TextEmbedding(vocab, d_model=32)

# Test with 3 sentences of different lengths
sentences = [
    "the cat sat on a mat",     # 6 tokens
    "the big dog ran fast",     # 5 tokens
    "a small red cat",          # 4 tokens
]

output = embed_module(sentences)
print(f"\nBatch size: {len(sentences)}")
print(f"Output: {output.shape}")
# → torch.Size([3, 6, 32])
# All padded to length 6 (longest sentence), each token = 32-dim vector

Exercise 4: Visualize Positional Encoding Heatmap

Generate positional encodings for seq_len=50 and d_model=32. Create a heatmap showing the encoding values. Observe how the wavelength increases across dimensions.

Solution
import torch
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import math

seq_len = 50
d_model = 32

pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
div_term = torch.exp(
    torch.arange(0, d_model, 2, dtype=torch.float32) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

plt.figure(figsize=(14, 6))
plt.imshow(pe.numpy().T, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
plt.colorbar(label='Encoding Value')
plt.xlabel("Position in Sequence", fontsize=12)
plt.ylabel("Embedding Dimension", fontsize=12)
plt.title("Sinusoidal Positional Encoding Heatmap", fontsize=14)
plt.tight_layout()
plt.savefig("pe_heatmap.png", dpi=150)
print("Plot saved to pe_heatmap.png")
plt.close()

print("Observation: Lower dimensions (top rows) oscillate rapidly,")
print("while higher dimensions (bottom rows) oscillate slowly.")
print("This gives each position a unique 'fingerprint' of values.")

What you’ll observe: The heatmap shows a beautiful pattern. The first few dimensions have rapid oscillations (short wavelength), while later dimensions oscillate very slowly (long wavelength). This multi-frequency encoding ensures that every position gets a unique pattern, and the model can detect both local and global positional relationships.


What’s Next?

You now have all the pieces for the input side of a transformer: raw text goes in, dense vectors encoding both meaning and position come out. Each token is now represented as a rich numerical vector that the model can operate on.

But here’s the question that Chapter 5 will answer: how does the model figure out which tokens should pay attention to which other tokens? The sentence “The cat sat on the mat because it was tired” — what does “it” refer to? The cat or the mat? The answer lies in the attention mechanism, and that’s where we’re headed next.