Skip to main content

On This Page

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

5 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

This article evaluates six prominent inference runtimes for large language model (LLM) serving in 2025, emphasizing their design principles, performance metrics, and suitability for specific workloads. The key factors influencing their effectiveness are KV cache management, throughput, latency, and hardware compatibility.


1. vLLM: PagedAttention for High Throughput

  • Design: Uses PagedAttention, a block-based KV cache system that partitions sequences into fixed-size blocks, reducing fragmentation to <4% compared to 60–80% in naive allocators.
  • Performance:
    • 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.
    • Integrates FP8 KV quantization to reduce memory usage and improve decode throughput.
  • KV & Memory Behavior: Optimized for GPU utilization with continuous batching and native support for prefix sharing.
  • Use Case: Default choice for general-purpose LLM serving requiring high throughput and hardware flexibility.

2. TensorRT LLM: NVIDIA-Centric Latency Optimization

  • Design: Built on NVIDIA TensorRT, generating fused kernels per model and shape. Features paged KV cache, quantized KV (INT8/FP8), and offloading to CPU.
  • Performance:
    • 14× reduction in time-to-first-token (TTFT) for CPU-based KV reuse on H100/NGH200 GPUs.
    • Low single-request latency when compiled for exact models/configurations.
  • KV & Memory Behavior: Strong control over memory via paged and quantized KV, with application-level cache-aware routing.
  • Use Case: Latency-critical workloads in NVIDIA-only environments requiring per-model tuning.

3. Hugging Face TGI v3: Long Prompt Optimization

  • Design: Server-focused stack with chunked prefill for long inputs and prefix KV caching for chat histories. Supports PyTorch/TensorRT backends.
  • Performance:
    • 3× higher tokens per second and 13× faster than vLLM for long prompts with prefix caching.
    • Similar P50 latency to vLLM for short/mid-length prompts.
  • KV & Memory Behavior: Uses paged attention and reduces memory via chunking and quantization (e.g., bitsandbytes, GPTQ).
  • Use Case: Production stacks on Hugging Face, especially for chat workloads with long histories.

4. LMDeploy: TurboMind for Throughput Maximization

  • Design: Combines TurboMind (NVIDIA CUDA kernels) and a PyTorch fallback. Features blocked KV cache, dynamic splitting, and weight/KV quantization (AWQ, INT8/INT4).
  • Performance:
    • 1.8× higher request throughput than vLLM, attributed to persistent batching and blocked KV.
    • 4-bit inference 2.4× faster than FP16 for supported models.
  • KV & Memory Behavior: Blocked KV trade contiguous buffers for a grid of chunks, similar to PagedAttention but with different layout.
  • Use Case: NVIDIA-centric deployments prioritizing maximum throughput with TurboMind tooling.

5. SGLang: RadixAttention for Structured Workloads

  • Design: Implements RadixAttention, a prefix-sharing mechanism using a radix tree for KV storage. Also supports a DSL for structured LLM programs (e.g., agents, RAG).
  • Performance:
    • 6.4× higher throughput and 3.7× lower latency than vLLM on structured workloads.
    • 50–99% KV cache hit rates for multi-turn chats or repeated contexts.
  • KV & Memory Behavior: Focuses on reuse via radix trees, integrating with hierarchical caching systems (GPU/CPU).
  • Use Case: Agentic systems, RAG pipelines, and toolchains with heavy prefix reuse.

6. DeepSpeed Inference / ZeRO Inference: Offloading for Large Models

  • Design: Uses ZeRO Inference to offload model weights and KV cache to CPU/NVMe for running models exceeding GPU memory.
  • Performance:
    • 43 tokens/second on CPU offload (V100 32GB), 30 tokens/second on NVMe.
    • 1.3–2.4× faster than partial offload due to larger batch sizes.
  • KV & Memory Behavior: Offloads weights/KV to reduce GPU memory usage, but suffers from high TTFT/P99 latency.
  • Use Case: Very large models on constrained GPUs (e.g., 32GB V100) for offline/batch inference.

Comparison Summary Table

RuntimeMain Design IdeaRelative StrengthKV StrategyTypical Use Case
vLLMPagedAttention, continuous batchingHigh throughput at given TTFTPaged KV blocks, FP8 supportGeneral-purpose GPU serving
TensorRT LLMCompiled kernels + KV reuseLow latency, high throughput on NVIDIAPaged/quantized KV, offloadNVIDIA-only, latency-sensitive workloads
TGI v3Long prompt pipelineStrong long-prompt performancePaged KV, chunked prefillHugging Face APIs, long chat histories
LMDeployTurboMind + blocked KV1.8× vLLM throughput for 4-bit modelsBlocked KV, weight/KV quantizationNVIDIA deployments, raw throughput
SGLangRadixAttention + structured programs6.4× throughput for structured workloadsRadix tree KV reuseAgents, RAG, high prefix reuse
DeepSpeedOffloading for huge modelsEnables large models on small GPUsCPU/NVMe offloadOffline inference, low-QPS services

Practical Recommendations

  • Default Choice: Use vLLM for balanced throughput, TTFT, and hardware flexibility.
  • NVIDIA Focus: Opt for TensorRT LLM if fine-grained latency control and NVIDIA-specific tuning are needed.
  • Hugging Face Integration: Choose TGI v3 for long chat workflows with prefix caching.
  • Max Throughput: Use LMDeploy with TurboMind for 4-bit models on NVIDIA GPUs.
  • Structured Workloads: Deploy SGLang for agents/RAG with high prefix reuse.
  • Large Models on Small GPUs: Use DeepSpeed/ZeRO for offloading, accepting higher latency.

Conclusion

All six runtimes prioritize KV cache optimization as the critical bottleneck for LLM serving. The most effective solutions treat KV as a first-class data structure, employing techniques like paging, quantization, reuse, and offloading to maximize throughput and minimize latency. The choice depends on workload type, hardware constraints, and performance priorities.

Reference: Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Continue reading

Next article

Understanding Amazon EBS: Persistent Storage for EC2 Instances

Related Content