NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Delivers near-Lossless 2x-4x Compression

KVzap, a surrogate model on hidden states

NVIDIA’s KVzap introduces a novel approach to KV cache pruning, addressing the memory limitations of long-context language models. The KV cache, storing keys and values for every transformer layer, can reach 335 GB for a Llama1-65B model at 128k tokens, directly impacting batch size and inference speed.

Why This Matters

Existing cache compression techniques primarily focus on axes other than the sequence length, failing to tackle the core memory constraint at scale. Without sequence-level pruning, serving long-context models becomes prohibitively expensive, hindering the practical application of increasingly capable LLMs – particularly as context windows expand to hundreds of thousands of tokens.

Key Insights

KVpress Leaderboard, 2026: KVzap currently ranks as the strongest cache pruning baseline on this public leaderboard.
Surrogate Model Approach: KVzap utilizes a small, per-layer surrogate model to approximate the computationally expensive KVzip+ oracle scoring method.
Nemotron Dataset: KVzap is trained on filtered prompts from the Nemotron Pretraining Dataset, utilizing approximately 1.2 million training pairs per head.

Working Example

# Example usage of KVzap (conceptual - actual implementation in KVpress)
import torch

def apply_kvzap(hidden_states, kvzap_model, threshold):
  """
  Applies KVzap pruning to the hidden states.

  Args:
    hidden_states: Tensor of hidden states.
    kvzap_model: Trained KVzap model.
    threshold: Pruning threshold.

  Returns:
    Compressed key and value tensors.
  """
  scores = kvzap_model(hidden_states)
  mask = scores > threshold
  compressed_keys = keys[mask]
  compressed_values = values[mask]
  return compressed_keys, compressed_values

Practical Applications

Long-Context LLM Serving: KVzap enables more efficient deployment of long-context models like Qwen3 and Llama-3 by reducing memory footprint.
Resource-Constrained Environments: KVzap offers a pathway for running large LLMs on hardware with limited VRAM.

References:

https://www.marktechpost.com/2026/01/15/nvidia-ai-open-sourced-kvzap-a-sota-kv-cache-pruning-method-that-delivers-near-lossless-2x-4x-compression/

On This Page

KVzap, a surrogate model on hidden states

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Meta Releases TRIBE v2: A Tri-Modal Foundation Model for High-Resolution fMRI Prediction

Google DeepMind's AlphaEvolve: LLM-Driven Semantic Evolution for MARL Algorithms

Tilde Research Aurora: Solving the Neuron Death Crisis in Muon Optimizers