NewsFlash Articles Data Fundraising Skill&API

DeepSeek teams up with Xiaomi to create a million-scale zero-cost context era, accelerating the popularization of Agent scenarios

According to DynaWatch Beating monitoring, following Ali's Qwen team enabling an 80% maximum discount on input cost for Qwen3.7-Max through implicit caching, Xiaomi announced a permanent price reduction for its self-developed MiMo-V2.5 series API, aligning prices completely with the DeepSeek V4 series. The input cache hit price for flagship models is set at $0.0036 per million tokens, while the miss price is $0.435, and the output is $0.87. This alignment action aims to capture global developer traffic and accelerate the adoption of intelligent body Agent scenarios.

Within a month of its release on April 24, DeepSeek V4 Flash topped the OpenRouter monthly leaderboard with a consumption of 7.99 trillion tokens, with the V4 Pro ranking in the top ten. In Agent programming scenarios with high-frequency code repository reads such as Cursor and Claude Code, developers using the Pro model benefit from a 99% prefix cache hit rate, consuming 80 million tokens for only 4 RMB, while using the Flash model consumes 27.8 billion tokens in a single day for just $160.

In comparison, Ali's Qwen3.7-Max's automatic implicit cache offers only an 80% discount, while explicit caching faces a 125% initial creation premium and a 5-minute lifespan. The high creation premium and short residency imply a high cost in cache construction and retention, with unit token compute load and KV cache occupancy limiting the room for concessions.

Xiaomi and DeepSeek dare to reduce prices, benefiting from the underlying algorithmic dividend. In a 1 million token inference, DeepSeek V4 relies on Compressed Sparse Attention (CSA) and Highly Compressed Attention (HCA) to reduce inference computational FLOPs by 27% compared to the previous generation, and reduces KV cache space by 10%, achieving over a hundredfold reduction from traditional GQA models. Xiaomi's MiMo-V2.5-Pro activates only 4.1% (42B) of the total 1.02T parameters, using Interleaved Stacked Sliding Windows Attention (SWA) and Global Attention (GA) to reduce long-context KV cache overhead by 7 times, and with Multi-Token Prediction (MTP) increasing output throughput by 3 times. Both solutions squeeze resources at the algorithmic level, heralding the arrival of a low-cost era of widespread adoption.

Source

Correction/Report

On-Chain Activity