Sakana AI and NVIDIA Introduce TwELL: 20.5% Faster LLM Inference via Unstructured Sparsity
These articles are AI-generated summaries. Please check the original sources for full details.
Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs
Sakana AI and NVIDIA researchers have developed TwELL, a new sparse data format and custom CUDA kernels for gated feedforward layers. This system achieves a 20.5% increase in inference throughput and a 21.9% training speedup by skipping zero-value activations.
Why This Matters
Feedforward layers account for over two-thirds of LLM parameters and 80% of total FLOPs, yet activation sparsity—where many hidden neurons produce zero after activation—has historically been ignored by GPUs. NVIDIA GPUs are optimized for dense matrix multiplications using Tensor Cores, and previous sparse formats like ELLPACK introduced conversion overhead that negated any theoretical performance gains. The technical challenge lies in the compute-bound GEMM regime of batched training and high-throughput inference. By introducing Tile-wise ELLPACK (TwELL), researchers can now perform sparse projections within the existing kernel epilogue, eliminating the need for extra memory reads or global synchronization that previously made sparse operations slower than dense baselines.
Key Insights
- TwELL (Tile-wise ELLPACK) partitions columns into horizontal tiles matching the T_n dimension of the matmul kernel to allow production in the kernel epilogue without global synchronization.
- Replacing SiLU with ReLU and applying an L1 regularization term of 2×10⁻⁵ enables 99.5% sparsity in 1.5B models with no measurable downstream performance loss.
- A fused inference kernel executes up and down projections simultaneously, cutting DRAM traffic by never materializing the intermediate hidden state in global memory.
- A hybrid sparse format for training dynamically routes overflow rows to a dense backup, handling non-uniform sparsity patterns that typically break standard ELL layouts.
- Energy efficiency improved by 17% in 2B parameter models, reducing energy per token from 7.85 mJ to 6.51 mJ.
- Sparsity is highest in early and middle layers, showing a Pearson correlation of -0.996 between non-zero counts and inference speedup contributions.
Working Examples
L1 regularization added to the standard cross-entropy loss to induce activation sparsity.
L1_loss_term = 2e-5 * torch.mean(torch.abs(ffn_activations))
total_loss = cross_entropy_loss + L1_loss_term
Practical Applications
- Scaling LLM training on H100 clusters: Use the hybrid sparse format to reduce peak activation memory by 28.1% for 1.5B models. Pitfall: Neglecting to use L1 regularization results in dense activations that bypass the speedup.
- High-throughput batch inference: Implement TwELL fused kernels to achieve 20.5% faster forward execution. Pitfall: Attempting to use these kernels on pre-trained SiLU models without fine-tuning for ReLU sparsity.
References:
Continue reading
Next article
Scaling Shopify Apps: Advanced Load Balancing and Resilience Strategies
Related Content
NVIDIA NeMo RL Accelerates LLM Post-Training with Lossless Speculative Decoding
NVIDIA Research integrates speculative decoding into NeMo RL v0.6.0, achieving a 1.8x rollout generation speedup at 8B scale and projecting a 2.5x end-to-end training speedup for 235B models.
Google AI Releases MTP Drafters for Gemma 4: Accelerating Inference by 3x
Google AI releases MTP drafters for Gemma 4, using speculative decoding to deliver up to 3x faster inference without quality loss.
Zyphra ZAYA1-8B-Diffusion: Achieving 7.7x Speedup via Autoregressive to MoE Diffusion Conversion
Zyphra releases ZAYA1-8B-Diffusion-Preview, the first MoE diffusion model converted from an LLM, achieving up to 7.7x inference speedup on AMD hardware.