PrfaaS: Scaling LLM Serving via Cross-Datacenter Prefill-as-a-Service
These articles are AI-generated summaries. Please check the original sources for full details.
Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale
Researchers from Moonshot AI and Tsinghua University have introduced Prefill-as-a-Service (PrfaaS) to decouple LLM prefill and decode across physically separate datacenters. The system achieves 54% higher throughput than homogeneous baselines by exploiting hybrid attention models that reduce KVCache size by up to 36x.
Why This Matters
Conventional LLM serving is constrained by the massive KVCache of Grouped Query Attention (GQA) models, which requires RDMA-class interconnects and limits prefill-decode disaggregation to single datacenters. As models transition to hybrid architectures like MLA or Kimi Delta Attention, the KVCache footprint drops significantly—down to 3.19 Gbps for a 1T-parameter model at 32K tokens. This architectural shift enables the use of commodity Ethernet for cross-datacenter cache transfers, allowing providers to centralize compute-heavy prefill on high-end H200 clusters while offloading memory-intensive decode to geographically distributed H20 nodes.
Key Insights
- Hybrid attention stacks in models like Ring-2.5-1T (2026) combine MLA and Kimi Delta Attention to achieve a 36x reduction in KVCache memory compared to standard GQA architectures.
- Length-based threshold routing with an optimal threshold of 19.4K tokens directs long-context requests to standalone PrfaaS clusters while processing short requests locally.
- The PrfaaS storage subsystem utilizes a distributed hybrid prefix cache pool that separates fixed-size recurrent states from linearly growing full-attention KVCache blocks.
- Prefill pipelining and multi-connection TCP transport enable KVCache transmission over 100 Gbps Ethernet links with only 13% egress utilization for a 32-GPU H200 cluster.
- Dual-timescale scheduling manages bursty traffic by monitoring PrfaaS queue depth at short intervals and rebalancing node counts in the local cluster over longer periods.
Practical Applications
- Large-scale inference providers using NVIDIA H200 and H20 GPUs: Deploying PrfaaS can reduce Mean Time to First Token (TTFT) by 50% for long-context requests. Pitfall: Using naive routing without congestion monitoring leads to unstable queuing and stalled compute during Ethernet bursts.
- Hybrid-architecture model serving (e.g., Qwen3.5-397B): Leveraging MLA and SWA layers allows cross-cluster cache transfers over standard VPC networks. Pitfall: Implementing cross-datacenter PD for traditional GQA models results in 60 Gbps throughput requirements that exceed commodity Ethernet capacity.
References:
Continue reading
Next article
AI News Weekly Summary: Apr 11 - Apr 19, 2026
Related Content
NVIDIA AI Introduces TiDAR: A Hybrid Diffusion Autoregressive Architecture For High Throughput LLM Inference
NVIDIA's TiDAR achieves 5.91x speedup on 8B models while maintaining autoregressive quality.
Anthropic's Research Demonstrates Claude's Introspective Awareness Through Concept Injection in Controlled Layers
Anthropic's study reveals that Claude models can detect injected concepts via internal activations, offering causal evidence of introspection. The research highlights controlled success rates and implications for LLM transparency.
Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal Inputs via SigLIP and 128K Context
Google released T5Gemma 2, a family of open-source encoder-decoder models inheriting Gemma 3’s multimodality and 128K context length.