Skip to main content

On This Page

Nous Research Token Superposition Training: Accelerating LLM Pre-training by 2.5x

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

Nous Research has introduced Token Superposition Training (TST), a two-phase pre-training method that accelerates wall-clock time by up to 2.5x. At the 10B-A1B mixture-of-experts scale, TST reached target loss in 4,768 B200-GPU-hours compared to 12,311 hours for the baseline.

Why This Matters

Modern LLM pre-training is constrained by the extreme cost of processing trillions of tokens, often overtraining models far beyond compute-optimal points. TST addresses this by dramatically increasing data throughput per FLOP during the initial training phase, allowing models to ingest more text per unit of compute without permanently altering the inference-time architecture or tokenizer.

Key Insights

  • TST uses a Phase 1 Superposition where input sequences are collapsed into bags of s tokens, increasing text ingestion per FLOP by factor s (Nous Research, 2026).
  • The method replaces standard cross-entropy with a multi-hot cross-entropy (MCE) loss that can be implemented using existing fused CE kernels without new auxiliary heads.
  • Experiments at the 3B scale showed TST achieving a loss of 2.676 in 20,000 steps, nearly matching a 36,000-step baseline while using 44 percent less GPU time.
  • Ablation studies confirm that shared representations are critical; re-initializing embeddings at the recovery phase transition caused loss to spike to 2.938, worse than the baseline.
  • TST is effective across scales from 270M to 10B parameters, showing consistent gains in downstream benchmarks like HellaSwag, ARC, and MMLU.

Practical Applications

  • Compute-bound pre-training (NVIDIA B200 clusters + TST Phase 1): Dramatically reduces wall-clock time for large-scale MoE models. Pitfall: Using TST in data-bound settings where raw token consumption is the bottleneck rather than compute.
  • Future-signal auxiliary objectives (Single output head + MCE loss): Regularizes embedding geometry without the parameter overhead of Multi-Token Prediction (MTP). Pitfall: Failing to maintain representation continuity between Phase 1 and Phase 2, which destroys training efficiency.

References:

Continue reading

Next article

Portfolio Optimization with skfolio: A Scikit-Learn Compatible Approach to Modern Investment Strategies

Related Content