Comparing the Top 6 Inference Runtimes for LLM Serving in 2025
These articles are AI-generated summaries. Please check the original sources for full details.
Comparing the Top 6 Inference Runtimes for LLM Serving in 2025
This article evaluates six prominent inference runtimes for large language model (LLM) serving in 2025, emphasizing their design principles, performance metrics, and suitability for specific workloads. The key factors influencing their effectiveness are KV cache management, throughput, latency, and hardware compatibility.
1. vLLM: PagedAttention for High Throughput
- Design: Uses PagedAttention, a block-based KV cache system that partitions sequences into fixed-size blocks, reducing fragmentation to <4% compared to 60–80% in naive allocators.
- Performance:
- 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.
- Integrates FP8 KV quantization to reduce memory usage and improve decode throughput.
- KV & Memory Behavior: Optimized for GPU utilization with continuous batching and native support for prefix sharing.
- Use Case: Default choice for general-purpose LLM serving requiring high throughput and hardware flexibility.
2. TensorRT LLM: NVIDIA-Centric Latency Optimization
- Design: Built on NVIDIA TensorRT, generating fused kernels per model and shape. Features paged KV cache, quantized KV (INT8/FP8), and offloading to CPU.
- Performance:
- 14× reduction in time-to-first-token (TTFT) for CPU-based KV reuse on H100/NGH200 GPUs.
- Low single-request latency when compiled for exact models/configurations.
- KV & Memory Behavior: Strong control over memory via paged and quantized KV, with application-level cache-aware routing.
- Use Case: Latency-critical workloads in NVIDIA-only environments requiring per-model tuning.
3. Hugging Face TGI v3: Long Prompt Optimization
- Design: Server-focused stack with chunked prefill for long inputs and prefix KV caching for chat histories. Supports PyTorch/TensorRT backends.
- Performance:
- 3× higher tokens per second and 13× faster than vLLM for long prompts with prefix caching.
- Similar P50 latency to vLLM for short/mid-length prompts.
- KV & Memory Behavior: Uses paged attention and reduces memory via chunking and quantization (e.g., bitsandbytes, GPTQ).
- Use Case: Production stacks on Hugging Face, especially for chat workloads with long histories.
4. LMDeploy: TurboMind for Throughput Maximization
- Design: Combines TurboMind (NVIDIA CUDA kernels) and a PyTorch fallback. Features blocked KV cache, dynamic splitting, and weight/KV quantization (AWQ, INT8/INT4).
- Performance:
- 1.8× higher request throughput than vLLM, attributed to persistent batching and blocked KV.
- 4-bit inference 2.4× faster than FP16 for supported models.
- KV & Memory Behavior: Blocked KV trade contiguous buffers for a grid of chunks, similar to PagedAttention but with different layout.
- Use Case: NVIDIA-centric deployments prioritizing maximum throughput with TurboMind tooling.
5. SGLang: RadixAttention for Structured Workloads
- Design: Implements RadixAttention, a prefix-sharing mechanism using a radix tree for KV storage. Also supports a DSL for structured LLM programs (e.g., agents, RAG).
- Performance:
- 6.4× higher throughput and 3.7× lower latency than vLLM on structured workloads.
- 50–99% KV cache hit rates for multi-turn chats or repeated contexts.
- KV & Memory Behavior: Focuses on reuse via radix trees, integrating with hierarchical caching systems (GPU/CPU).
- Use Case: Agentic systems, RAG pipelines, and toolchains with heavy prefix reuse.
6. DeepSpeed Inference / ZeRO Inference: Offloading for Large Models
- Design: Uses ZeRO Inference to offload model weights and KV cache to CPU/NVMe for running models exceeding GPU memory.
- Performance:
- 43 tokens/second on CPU offload (V100 32GB), 30 tokens/second on NVMe.
- 1.3–2.4× faster than partial offload due to larger batch sizes.
- KV & Memory Behavior: Offloads weights/KV to reduce GPU memory usage, but suffers from high TTFT/P99 latency.
- Use Case: Very large models on constrained GPUs (e.g., 32GB V100) for offline/batch inference.
Comparison Summary Table
| Runtime | Main Design Idea | Relative Strength | KV Strategy | Typical Use Case |
|---|---|---|---|---|
| vLLM | PagedAttention, continuous batching | High throughput at given TTFT | Paged KV blocks, FP8 support | General-purpose GPU serving |
| TensorRT LLM | Compiled kernels + KV reuse | Low latency, high throughput on NVIDIA | Paged/quantized KV, offload | NVIDIA-only, latency-sensitive workloads |
| TGI v3 | Long prompt pipeline | Strong long-prompt performance | Paged KV, chunked prefill | Hugging Face APIs, long chat histories |
| LMDeploy | TurboMind + blocked KV | 1.8× vLLM throughput for 4-bit models | Blocked KV, weight/KV quantization | NVIDIA deployments, raw throughput |
| SGLang | RadixAttention + structured programs | 6.4× throughput for structured workloads | Radix tree KV reuse | Agents, RAG, high prefix reuse |
| DeepSpeed | Offloading for huge models | Enables large models on small GPUs | CPU/NVMe offload | Offline inference, low-QPS services |
Practical Recommendations
- Default Choice: Use vLLM for balanced throughput, TTFT, and hardware flexibility.
- NVIDIA Focus: Opt for TensorRT LLM if fine-grained latency control and NVIDIA-specific tuning are needed.
- Hugging Face Integration: Choose TGI v3 for long chat workflows with prefix caching.
- Max Throughput: Use LMDeploy with TurboMind for 4-bit models on NVIDIA GPUs.
- Structured Workloads: Deploy SGLang for agents/RAG with high prefix reuse.
- Large Models on Small GPUs: Use DeepSpeed/ZeRO for offloading, accepting higher latency.
Conclusion
All six runtimes prioritize KV cache optimization as the critical bottleneck for LLM serving. The most effective solutions treat KV as a first-class data structure, employing techniques like paging, quantization, reuse, and offloading to maximize throughput and minimize latency. The choice depends on workload type, hardware constraints, and performance priorities.
Reference: Comparing the Top 6 Inference Runtimes for LLM Serving in 2025
Continue reading
Next article
Understanding Amazon EBS: Persistent Storage for EC2 Instances
Related Content
Comparing the Top 6 OCR Models in 2025: A Comprehensive Analysis
A detailed comparison of six leading OCR systems in 2025, including Google Cloud Document AI, AWS Textract, Azure AI Document Intelligence, ABBYY, PaddleOCR 3.0, and DeepSeek OCR, with focus on performance, deployment, and use cases.
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025
Compare the top 7 large language models and systems for coding in 2025. Discover which ones excel for software engineering tasks.
Tencent Hunyuan Releases HPC-Ops: A High Performance LLM Inference Operator Library
Tencent Hunyuan’s HPC-Ops delivers up to 2.22x speedup in LLM inference, boosting queries per minute by 30% for Tencent-HY models.