Sakana AI and NVIDIA Introduce TwELL: 20.5% Faster LLM Inference via Unstructured Sparsity

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

Sakana AI and NVIDIA researchers have developed TwELL, a new sparse data format and custom CUDA kernels for gated feedforward layers. This system achieves a 20.5% increase in inference throughput and a 21.9% training speedup by skipping zero-value activations.

Why This Matters

Feedforward layers account for over two-thirds of LLM parameters and 80% of total FLOPs, yet activation sparsity—where many hidden neurons produce zero after activation—has historically been ignored by GPUs. NVIDIA GPUs are optimized for dense matrix multiplications using Tensor Cores, and previous sparse formats like ELLPACK introduced conversion overhead that negated any theoretical performance gains. The technical challenge lies in the compute-bound GEMM regime of batched training and high-throughput inference. By introducing Tile-wise ELLPACK (TwELL), researchers can now perform sparse projections within the existing kernel epilogue, eliminating the need for extra memory reads or global synchronization that previously made sparse operations slower than dense baselines.

Key Insights

TwELL (Tile-wise ELLPACK) partitions columns into horizontal tiles matching the T_n dimension of the matmul kernel to allow production in the kernel epilogue without global synchronization.
Replacing SiLU with ReLU and applying an L1 regularization term of 2×10⁻⁵ enables 99.5% sparsity in 1.5B models with no measurable downstream performance loss.
A fused inference kernel executes up and down projections simultaneously, cutting DRAM traffic by never materializing the intermediate hidden state in global memory.
A hybrid sparse format for training dynamically routes overflow rows to a dense backup, handling non-uniform sparsity patterns that typically break standard ELL layouts.
Energy efficiency improved by 17% in 2B parameter models, reducing energy per token from 7.85 mJ to 6.51 mJ.
Sparsity is highest in early and middle layers, showing a Pearson correlation of -0.996 between non-zero counts and inference speedup contributions.

Working Examples

L1 regularization added to the standard cross-entropy loss to induce activation sparsity.

L1_loss_term = 2e-5 * torch.mean(torch.abs(ffn_activations))
total_loss = cross_entropy_loss + L1_loss_term

Practical Applications

Scaling LLM training on H100 clusters: Use the hybrid sparse format to reduce peak activation memory by 28.1% for 1.5B models. Pitfall: Neglecting to use L1 regularization results in dense activations that bypass the speedup.
High-throughput batch inference: Implement TwELL fused kernels to achieve 20.5% faster forward execution. Pitfall: Attempting to use these kernels on pre-trained SiLU models without fine-tuning for ReLU sparsity.

References:

https://www.marktechpost.com/2026/05/11/sakana-ai-and-nvidia-introduce-twell-with-cuda-kernels-for-20-5-inference-and-21-9-training-speedup-in-llms/

On This Page

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

NVIDIA NeMo RL Accelerates LLM Post-Training with Lossless Speculative Decoding

Top 10 KV Cache Compression Techniques for LLM Inference

Google AI Releases MTP Drafters for Gemma 4: Accelerating Inference by 3x