Nous Research Token Superposition Training: Accelerating LLM Pre-training by 2.5x
These articles are AI-generated summaries. Please check the original sources for full details.
Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Nous Research has introduced Token Superposition Training (TST), a two-phase pre-training method that accelerates wall-clock time by up to 2.5x. At the 10B-A1B mixture-of-experts scale, TST reached target loss in 4,768 B200-GPU-hours compared to 12,311 hours for the baseline.
Why This Matters
Modern LLM pre-training is constrained by the extreme cost of processing trillions of tokens, often overtraining models far beyond compute-optimal points. TST addresses this by dramatically increasing data throughput per FLOP during the initial training phase, allowing models to ingest more text per unit of compute without permanently altering the inference-time architecture or tokenizer.
Key Insights
- TST uses a Phase 1 Superposition where input sequences are collapsed into bags of s tokens, increasing text ingestion per FLOP by factor s (Nous Research, 2026).
- The method replaces standard cross-entropy with a multi-hot cross-entropy (MCE) loss that can be implemented using existing fused CE kernels without new auxiliary heads.
- Experiments at the 3B scale showed TST achieving a loss of 2.676 in 20,000 steps, nearly matching a 36,000-step baseline while using 44 percent less GPU time.
- Ablation studies confirm that shared representations are critical; re-initializing embeddings at the recovery phase transition caused loss to spike to 2.938, worse than the baseline.
- TST is effective across scales from 270M to 10B parameters, showing consistent gains in downstream benchmarks like HellaSwag, ARC, and MMLU.
Practical Applications
- Compute-bound pre-training (NVIDIA B200 clusters + TST Phase 1): Dramatically reduces wall-clock time for large-scale MoE models. Pitfall: Using TST in data-bound settings where raw token consumption is the bottleneck rather than compute.
- Future-signal auxiliary objectives (Single output head + MCE loss): Regularizes embedding geometry without the parameter overhead of Multi-Token Prediction (MTP). Pitfall: Failing to maintain representation continuity between Phase 1 and Phase 2, which destroys training efficiency.
References:
Continue reading
Next article
Portfolio Optimization with skfolio: A Scikit-Learn Compatible Approach to Modern Investment Strategies
Related Content
Zyphra ZAYA1-8B-Diffusion: Achieving 7.7x Speedup via Autoregressive to MoE Diffusion Conversion
Zyphra releases ZAYA1-8B-Diffusion-Preview, the first MoE diffusion model converted from an LLM, achieving up to 7.7x inference speedup on AMD hardware.
Qwen-Scope: Open-Source Sparse AutoEncoders for LLM Interpretability and Steering
Qwen AI releases Qwen-Scope, an open-source suite of 14 Sparse AutoEncoders (SAEs) for Qwen3/3.5 models, enabling inference-time steering and benchmark analysis without model runs.
Meta and Stanford Propose Fast Byte Latent Transformer to Slash Inference Bandwidth by Over 50%
Meta and Stanford researchers introduced BLT-D, reducing byte-level inference memory bandwidth by over 50% without tokenization.