Skip to main content

On This Page

NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Delivers near-Lossless 2x-4x Compression

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

KVzap, a surrogate model on hidden states

NVIDIA’s KVzap introduces a novel approach to KV cache pruning, addressing the memory limitations of long-context language models. The KV cache, storing keys and values for every transformer layer, can reach 335 GB for a Llama1-65B model at 128k tokens, directly impacting batch size and inference speed.

Why This Matters

Existing cache compression techniques primarily focus on axes other than the sequence length, failing to tackle the core memory constraint at scale. Without sequence-level pruning, serving long-context models becomes prohibitively expensive, hindering the practical application of increasingly capable LLMs – particularly as context windows expand to hundreds of thousands of tokens.

Key Insights

  • KVpress Leaderboard, 2026: KVzap currently ranks as the strongest cache pruning baseline on this public leaderboard.
  • Surrogate Model Approach: KVzap utilizes a small, per-layer surrogate model to approximate the computationally expensive KVzip+ oracle scoring method.
  • Nemotron Dataset: KVzap is trained on filtered prompts from the Nemotron Pretraining Dataset, utilizing approximately 1.2 million training pairs per head.

Working Example

# Example usage of KVzap (conceptual - actual implementation in KVpress)
import torch

def apply_kvzap(hidden_states, kvzap_model, threshold):
  """
  Applies KVzap pruning to the hidden states.

  Args:
    hidden_states: Tensor of hidden states.
    kvzap_model: Trained KVzap model.
    threshold: Pruning threshold.

  Returns:
    Compressed key and value tensors.
  """
  scores = kvzap_model(hidden_states)
  mask = scores > threshold
  compressed_keys = keys[mask]
  compressed_values = values[mask]
  return compressed_keys, compressed_values

Practical Applications

  • Long-Context LLM Serving: KVzap enables more efficient deployment of long-context models like Qwen3 and Llama-3 by reducing memory footprint.
  • Resource-Constrained Environments: KVzap offers a pathway for running large LLMs on hardware with limited VRAM.

References:

Continue reading

Next article

Open Responses: A New Standard for AI Agent Inference

Related Content