Cloudflare officially enters large-scale inference, unveils Kimi K2.5: Internal security Agent consumes 7 billion tokens per day, reducing costs by 77%

According to 1M AI News monitoring, Cloudflare recently announced that its Workers AI platform now supports large model inference. The first model to go live is Kimi K2.5 from the dark side of the moon, supporting a 256K context window, multi-turn tool invocation, visual input, and structured output. The Agents SDK template has set Kimi K2.5 as the default model.

Cloudflare has internally used Kimi K2.5 for day-to-day development. Engineers have adopted it as the main model for programming Agents in the OpenCode environment and have integrated it into the automated code review pipeline. One security audit Agent processes over 7 billion tokens per day and has identified more than 15 confirmed security issues in a single codebase. Cloudflare estimates that running the same task with a mid-range commercial model would cost around $2.4 million per year, but after switching to Kimi K2.5, the cost has decreased by 77%.

The platform has also introduced three enhancements:

1. Prefix cache discount: Input tokens already processed in multi-turn dialogs are no longer billed again, and cached tokens receive a discounted price
2. Session affinity header: A new x-session-affinity request header routes the same session to the same model instance to improve cache hit rate
3. Asynchronous batch inference API: Requests exceeding the synchronous rate limit can queue for asynchronous execution, with internal testing typically completing within 5 minutes, suitable for code scanning and research-oriented non-real-time Agents

Source

Correction/Report

On-Chain Activity