According to Dynasty Beating monitoring, the Dark Side of the Moon (Moonshot AI) and Tsinghua University released a new paper titled "Prefill-as-a-Service" on arXiv on April 16, proposing to enable the prefilled inference phase of large models to run across data centers.
The large model inference consists of two steps: prefilling reads the input in one go to generate a KV cache; decode then outputs the results character by character based on this cache. The hardware requirements for the two steps are completely different, where prefilling consumes compute power and decode consumes GPU memory bandwidth. The industry's mainstream approach is to separate the two steps onto different machines (PD separation), but this requires both sides to be interconnected in the same data center using RDMA, as the dense attention model's KV cache spits out data at tens of Gbps per second, causing GPUs to idle if the transfer is slow.
The turning point comes from the new generation of hybrid attention models. The paper experimentally evaluated models such as Kimi Linear, MiMo-V2-Flash, and Ring-2.5-1T, which combine a small number of full attention layers with a large number of linear layers, reducing the KV cache throughput by approximately an order of magnitude, with Ring-2.5-1T achieving a comprehensive compression ratio of 36 times. At this point, the KV cache can be moved from a dedicated RDMA network to a regular Ethernet.
The specifics of PrfaaS: establish an independent "prefill cluster" that only routes requests for long contexts and cache misses to be filled, while leaving short requests in the local PD cluster; after prefilling is completed, the KV cache is transferred back to the local cluster for decode via Ethernet. The approach introduces length threshold routing, bandwidth-aware schedulers, and a hybrid prefix cache pool. The paper conducted a set of tests using an internal 1T parameter hybrid model (based on the Kimi Linear architecture), showing an overall service throughput 54% higher than homogeneous PD deployment, 32% higher than a naive heterogeneous solution, with each machine occupying only a moderate amount of inter-data center bandwidth.
