Optimizing LLM Training with AdamW and Cosine Decay
These articles are AI-generated summaries. Please check the original sources for full details.
How to Speed-Up Training of Language Models
Language model training is slow, even for modest-sized models. A 2025 study found that AdamW with cosine decay reduces convergence time by 30% compared to vanilla Adam.
Why This Matters
Training large language models requires balancing computational cost with convergence stability. Ideal models would train rapidly without overfitting, but in practice, unstable gradients and memory constraints often force engineers to use suboptimal hyperparameters. For example, improper learning rate scheduling can increase training time by 50% for models with over 1B parameters.
Key Insights
- “AdamW with decoupled weight decay improves stability over Adam, 2017”
- “Cosine decay outperforms linear decay for learning rate scheduling in LLMs”
- “PyTorch’s
CosineAnnealingLRused by Meta and Google in LLaMA training pipelines”
Working Example
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR
# Example setup
model = torch.nn.Linear(10, 1)
X, y = torch.randn(5, 10), torch.randn(5)
loss_fn = nn.MSELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-2, weight_decay=0.1)
# Define learning rate schedulers
warmup_steps = 10
total_steps = 100
warmup_lr = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=warmup_steps)
cosine_lr = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-4)
combined_lr = SequentialLR(optimizer, schedulers=[warmup_lr, cosine_lr], milestones=[warmup_steps])
# Training loop
for step in range(total_steps):
y_pred = model(X)
loss = loss_fn(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
combined_lr.step()
Practical Applications
- Use Case: Training LLaMA-3 with AdamW and cosine decay for 100k steps
- Pitfall: Skipping warm-up phase causes gradient instability in first 5% of training steps
References:
- https://machinelearningmastery.com/how-to-speed-up-training-of-language-models/
- https://arxiv.org/abs/1711.05101 (AdamW paper)
- https://arxiv.org/abs/1804.07612 (SGDR paper)
- https://arxiv.org/abs/2312.12813 (Benchmarking Optimizers for LLMs)
Continue reading
Next article
Las Vegas' $2bn World Cup Stadium Fails to Address Critical Infrastructure Gaps
Related Content
Nous Research Token Superposition Training: Accelerating LLM Pre-training by 2.5x
Nous Research releases Token Superposition Training (TST), reducing LLM pre-training wall-clock time by 2.5x without changing model architecture.
The Critical Role of Datasets in Training Language Models
High-quality datasets like Common Crawl (9.5 PB) are essential for training robust language models, but require rigorous cleaning to mitigate biases and noise.
A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, Retrieves, and Reuses Skills as Neural Modules Over Time
This tutorial details building an AI agent with procedural memory, demonstrating a 10x improvement in task completion efficiency through skill reuse.