PrfaaS: Scaling LLM Serving via Cross-Datacenter Prefill-as-a-Service

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

Researchers from Moonshot AI and Tsinghua University have introduced Prefill-as-a-Service (PrfaaS) to decouple LLM prefill and decode across physically separate datacenters. The system achieves 54% higher throughput than homogeneous baselines by exploiting hybrid attention models that reduce KVCache size by up to 36x.

Why This Matters

Conventional LLM serving is constrained by the massive KVCache of Grouped Query Attention (GQA) models, which requires RDMA-class interconnects and limits prefill-decode disaggregation to single datacenters. As models transition to hybrid architectures like MLA or Kimi Delta Attention, the KVCache footprint drops significantly—down to 3.19 Gbps for a 1T-parameter model at 32K tokens. This architectural shift enables the use of commodity Ethernet for cross-datacenter cache transfers, allowing providers to centralize compute-heavy prefill on high-end H200 clusters while offloading memory-intensive decode to geographically distributed H20 nodes.

Key Insights

Hybrid attention stacks in models like Ring-2.5-1T (2026) combine MLA and Kimi Delta Attention to achieve a 36x reduction in KVCache memory compared to standard GQA architectures.
Length-based threshold routing with an optimal threshold of 19.4K tokens directs long-context requests to standalone PrfaaS clusters while processing short requests locally.
The PrfaaS storage subsystem utilizes a distributed hybrid prefix cache pool that separates fixed-size recurrent states from linearly growing full-attention KVCache blocks.
Prefill pipelining and multi-connection TCP transport enable KVCache transmission over 100 Gbps Ethernet links with only 13% egress utilization for a 32-GPU H200 cluster.
Dual-timescale scheduling manages bursty traffic by monitoring PrfaaS queue depth at short intervals and rebalancing node counts in the local cluster over longer periods.

Practical Applications

Large-scale inference providers using NVIDIA H200 and H20 GPUs: Deploying PrfaaS can reduce Mean Time to First Token (TTFT) by 50% for long-context requests. Pitfall: Using naive routing without congestion monitoring leads to unstable queuing and stalled compute during Ethernet bursts.
Hybrid-architecture model serving (e.g., Qwen3.5-397B): Leveraging MLA and SWA layers allows cross-cluster cache transfers over standard VPC networks. Pitfall: Implementing cross-datacenter PD for traditional GQA models results in 60 Gbps throughput requirements that exceed commodity Ethernet capacity.

References:

https://www.marktechpost.com/2026/04/19/moonshot-ai-and-tsinghua-researchers-propose-prfaas-a-cross-datacenter-kvcache-architecture-that-rethinks-how-llms-are-served-at-scale/

On This Page

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

NVIDIA AI Introduces TiDAR: A Hybrid Diffusion Autoregressive Architecture For High Throughput LLM Inference

Anthropic's Research Demonstrates Claude's Introspective Awareness Through Concept Injection in Controlled Layers

Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal Inputs via SigLIP and 128K Context