According to Insight Beating monitoring, following the permanent price reduction of the in-house large-scale model MiMo-V2.5 series API, Xiaomi's large model team leader, Luo Fuli, announced the algorithm cost reduction mechanism on the X platform.
Luo Fuli revealed that after the API price alignment with DeepSeek, Xiaomi's high-load inference engine can still maintain a break-even point. The cost reduction mainly comes from the hybrid attention architecture and hierarchical KV cache optimization.
To achieve the design goal of reducing Cache Hit costs by 99%, Xiaomi's inference framework implemented hierarchical KV cache optimization for Sliding Window Attention (SWA). Production tests showed that the hierarchical optimization increased the cache token capacity by 5 times, reducing cache costs by 80%. Combined with Cache Read Overlap between global attention modules, the system further reduced the actual overhead of cache hits.
Regarding the 60% to 80% reduction in basic input and output costs, Luo Fuli attributed it to the model's introduction of a 1:7 inter-layer sparsity ratio, where the layer ratio between Global Attention (GA) and Sliding Window Attention (SWA) is 1:7. During the long-text prefill stage, the 60-layer SWA only computes local sliding windows, making the overall attention computation of the 70-layer MiMo-V2.5-Pro model equivalent to that of a 10-layer traditional global GQA model. The ultra-low computational load reduced the original inference costs, leaving Xiaomi with a profit margin of 2 to 3 times before the price adjustment. Therefore, the price reduction is a manifestation of structural cost reduction rather than loss-making competition.
Luo Fuli stated that low-cost inference services help stimulate end-user intelligent demand. Large model enterprises should avoid blind price wars and, through the bottom-up coordinated design of algorithms and inference systems, keep actual operating expenses below the break-even point.
