header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Rofl Decrypts MiMo Cost Card: Pretrained Attention Reduced to 10 Layers Global GPT Level

According to Insight Beating monitoring, following the permanent price reduction of the in-house large-scale model MiMo-V2.5 series API, Xiaomi's large model team leader, Luo Fuli, announced the algorithm cost reduction mechanism on the X platform.

Luo Fuli revealed that after the API price alignment with DeepSeek, Xiaomi's high-load inference engine can still maintain a break-even point. The cost reduction mainly comes from the hybrid attention architecture and hierarchical KV cache optimization.

To achieve the design goal of reducing Cache Hit costs by 99%, Xiaomi's inference framework implemented hierarchical KV cache optimization for Sliding Window Attention (SWA). Production tests showed that the hierarchical optimization increased the cache token capacity by 5 times, reducing cache costs by 80%. Combined with Cache Read Overlap between global attention modules, the system further reduced the actual overhead of cache hits.

Regarding the 60% to 80% reduction in basic input and output costs, Luo Fuli attributed it to the model's introduction of a 1:7 inter-layer sparsity ratio, where the layer ratio between Global Attention (GA) and Sliding Window Attention (SWA) is 1:7. During the long-text prefill stage, the 60-layer SWA only computes local sliding windows, making the overall attention computation of the 70-layer MiMo-V2.5-Pro model equivalent to that of a 10-layer traditional global GQA model. The ultra-low computational load reduced the original inference costs, leaving Xiaomi with a profit margin of 2 to 3 times before the price adjustment. Therefore, the price reduction is a manifestation of structural cost reduction rather than loss-making competition.

Luo Fuli stated that low-cost inference services help stimulate end-user intelligent demand. Large model enterprises should avoid blind price wars and, through the bottom-up coordinated design of algorithms and inference systems, keep actual operating expenses below the break-even point.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish