NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Delivers near-Lossless 2x-4x Compression
These articles are AI-generated summaries. Please check the original sources for full details.
KVzap, a surrogate model on hidden states
NVIDIA’s KVzap introduces a novel approach to KV cache pruning, addressing the memory limitations of long-context language models. The KV cache, storing keys and values for every transformer layer, can reach 335 GB for a Llama1-65B model at 128k tokens, directly impacting batch size and inference speed.
Why This Matters
Existing cache compression techniques primarily focus on axes other than the sequence length, failing to tackle the core memory constraint at scale. Without sequence-level pruning, serving long-context models becomes prohibitively expensive, hindering the practical application of increasingly capable LLMs – particularly as context windows expand to hundreds of thousands of tokens.
Key Insights
- KVpress Leaderboard, 2026: KVzap currently ranks as the strongest cache pruning baseline on this public leaderboard.
- Surrogate Model Approach: KVzap utilizes a small, per-layer surrogate model to approximate the computationally expensive KVzip+ oracle scoring method.
- Nemotron Dataset: KVzap is trained on filtered prompts from the Nemotron Pretraining Dataset, utilizing approximately 1.2 million training pairs per head.
Working Example
# Example usage of KVzap (conceptual - actual implementation in KVpress)
import torch
def apply_kvzap(hidden_states, kvzap_model, threshold):
"""
Applies KVzap pruning to the hidden states.
Args:
hidden_states: Tensor of hidden states.
kvzap_model: Trained KVzap model.
threshold: Pruning threshold.
Returns:
Compressed key and value tensors.
"""
scores = kvzap_model(hidden_states)
mask = scores > threshold
compressed_keys = keys[mask]
compressed_values = values[mask]
return compressed_keys, compressed_values
Practical Applications
- Long-Context LLM Serving: KVzap enables more efficient deployment of long-context models like Qwen3 and Llama-3 by reducing memory footprint.
- Resource-Constrained Environments: KVzap offers a pathway for running large LLMs on hardware with limited VRAM.
References:
Continue reading
Next article
Open Responses: A New Standard for AI Agent Inference
Related Content
Tilde Research Aurora: Solving the Neuron Death Crisis in Muon Optimizers
Tilde Research introduces Aurora, a leverage-aware optimizer that fixes Muon's neuron death flaw, achieving 100x data efficiency and a new SoTA on modded-nanoGPT.
Meta Releases TRIBE v2: A Tri-Modal Foundation Model for High-Resolution fMRI Prediction
Meta’s FAIR team introduces TRIBE v2, a tri-modal foundation model that predicts fMRI responses across video, audio, and text stimuli, achieving a group correlation near 0.4 on the HCP 7T dataset.
Google DeepMind's AlphaEvolve: LLM-Driven Semantic Evolution for MARL Algorithms
DeepMind's AlphaEvolve uses LLMs to discover VAD-CFR, an algorithm that surpassed state-of-the-art performance in 10 out of 11 games through semantic evolution.