Skip to main content

On This Page

Adapting Rotary Position Embeddings (RoPE) for Long Context Lengths

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

RoPE for Long Context Length

Rotary Position Embeddings (RoPE) is a popular technique for encoding token positions in sequence models. While effective for standard context lengths, adapting RoPE for models exceeding 8K tokens requires modification to maintain performance. Llama 3, for example, achieves a context length of 131K tokens by scaling RoPE frequencies.

Traditional position embeddings struggle with long sequences, often leading to performance degradation or increased computational cost. RoPE’s reliance on relative positioning is advantageous, but naive extrapolation to very long sequences can still introduce instability and diminish the importance of local context. Scaling the RoPE frequencies addresses this by prioritizing short-range dependencies while enabling effective long-range modeling.

Key Insights

  • RoPE Formula: RoPE uses rotation matrices to encode position, defined by the equation: $X_{n,i} = X_{n,i} \cos(n\theta_i) – X_{n,\frac{d}{2}+i} \sin(n\theta_i)$.
  • Frequency Scaling: Models like Llama 3 adjust RoPE frequencies based on a base length (8192) to improve stability for extended contexts.
  • Llama 3 Implementation: Llama 3 employs a scaling factor of 8 and smooth interpolation to modify RoPE frequencies, balancing short and long-range dependencies.

Working Example

import torch
import torch.nn as nn
import math

def rotate_half(x: torch.Tensor) -> torch.Tensor:
    """Rotates half the hidden dims of the input."""
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat((-x2, x1), dim=-1)

class RotaryPositionEncoding(nn.Module):
    """Rotary position encoding."""
    def __init__(self, dim: int, max_position_embeddings: int, base_length: int = 8192):
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        N = 10_000.0
        scale_factor = 8.0
        low_factor, high_factor = 1.0, 4.0

        inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2).float().to("cuda") / dim))
        wavelen = 2 * math.pi / inv_freq
        max_wavelen = base_length / low_factor
        min_wavelen = base_length / high_factor
        smooth_factor = (base_length / wavelen - low_factor) / (high_factor - low_factor)
        smoothed = (1 - smooth_factor) * inv_freq / scale_factor + smooth_factor * inv_freq
        inv_freq = torch.where(wavelen > max_wavelen, inv_freq / scale_factor, torch.where(wavelen < min_wavelen, inv_freq, smoothed))
        inv_freq = torch.cat((inv_freq, inv_freq), dim=-1)

        position = torch.arange(max_position_embeddings).float()
        sinusoid_inp = torch.outer(position, inv_freq)
        self.register_buffer("cos", sinusoid_inp.cos())
        self.register_buffer("sin", sinusoid_inp.sin())

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, num_heads, head_dim = x.shape
        dtype = x.dtype
        cos = self.cos.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
        sin = self.sin.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
        output = (x * cos) + (rotate_half(x) * sin)
        return output

Practical Applications

  • Large Language Models: Llama 3 utilizes scaled RoPE to process extremely long documents and conversations.
  • Pitfall: Using standard RoPE for very long sequences can lead to a loss of positional information, especially for tokens near the beginning of the sequence, impacting performance.

References:

Continue reading

Next article

Terraform Day 12: Validation, Numeric, Time & File Functions – Writing Safer IaC

Related Content