NewsFlash Articles Data Fundraising Skill&API

Rofl Decrypts MiMo Cost Card: Pretrained Attention Reduced to 10 Layers Global GPT Level

According to Insight Beating monitoring, following the permanent price reduction of the in-house large-scale model MiMo-V2.5 series API, Xiaomi's large model team leader, Luo Fuli, announced the algorithm cost reduction mechanism on the X platform.

Luo Fuli revealed that after the API price alignment with DeepSeek, Xiaomi's high-load inference engine can still maintain a break-even point. The cost reduction mainly comes from the hybrid attention architecture and hierarchical KV cache optimization.

To achieve the design goal of reducing Cache Hit costs by 99%, Xiaomi's inference framework implemented hierarchical KV cache optimization for Sliding Window Attention (SWA). Production tests showed that the hierarchical optimization increased the cache token capacity by 5 times, reducing cache costs by 80%. Combined with Cache Read Overlap between global attention modules, the system further reduced the actual overhead of cache hits.

Regarding the 60% to 80% reduction in basic input and output costs, Luo Fuli attributed it to the model's introduction of a 1:7 inter-layer sparsity ratio, where the layer ratio between Global Attention (GA) and Sliding Window Attention (SWA) is 1:7. During the long-text prefill stage, the 60-layer SWA only computes local sliding windows, making the overall attention computation of the 70-layer MiMo-V2.5-Pro model equivalent to that of a 10-layer traditional global GQA model. The ultra-low computational load reduced the original inference costs, leaving Xiaomi with a profit margin of 2 to 3 times before the price adjustment. Therefore, the price reduction is a manifestation of structural cost reduction rather than loss-making competition.

Luo Fuli stated that low-cost inference services help stimulate end-user intelligent demand. Large model enterprises should avoid blind price wars and, through the bottom-up coordinated design of algorithms and inference systems, keep actual operating expenses below the break-even point.

Source

Correction/Report

On-Chain Activity

lastest

US Stock Market Surges, Attracting Capital into the Chainlink Ecosystem: HIP-3 Holdings Reach All-Time High, with Trade.xyz Accounting for 94.2%

24H Important News

2026-05-27

Trump: Iran Giving Up Uranium Mines Will Not Lead to Sanctions Relief

Robinhood Launches AI-Powered Trading Account

ChatGPT and the API are experiencing latency issues, and the team is working on resolving them.

Foreign Media: Trump May Unilaterally Announce Agreement Within Hours to Exert Negotiation Pressure

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

Rofl Decrypts MiMo Cost Card: Pretrained Attention Reduced to 10 Layers Global GPT Level

「pension-usdt.eth」 liquidated a 3x short ETH position, realizing a profit of $1.34 million

Bitwise HYPE ETF attracted buy orders worth $11.31 million in the past two hours.

A certain whale heavily invested in Ethereum ASTEROID, with a 20-day unrealized loss of $1.16 million

US Stock Market Surges, Attracting Capital into the Chainlink Ecosystem: HIP-3 Holdings Reach All-Time High, with Trade.xyz Accounting for 94.2%