header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Google has open-sourced Gemma4, a full-series MTP speculative decoding model, achieving up to 3x speedup with zero loss.

According to Pulse Beating monitoring, Google has released and open-sourced the Gemma 4 series multi-token prediction (MTP) draft model. This is a lightweight auxiliary model based on speculative decoding architecture, which can achieve up to 3x inference acceleration while the main model retains final validation authority, without any compromise on output quality and logical reasoning ability.

A standard large language model can only generate one token at a time, easily constrained by GPU bandwidth and leading to underutilization of computing power. The MTP approach enables a lightweight draft model to leverage idle computing power to proactively predict multiple future tokens at once, which are then parallelly validated by heavy-duty target models like 31B. If the target model agrees with the draft, it will accept the entire sequence at once. To further enhance efficiency, the draft model directly shares the target model's activation states and KV cache (storing historical context to avoid redundant computation); for the edge-side E2B and E4B models, the team has also introduced clustering technology in the embedding layer.

Currently, the MTP model has been fully open-sourced under the same Apache 2.0 license as Gemma 4 and natively supports popular inference frameworks such as vLLM, SGLang, and Ollama. This speed optimization significantly lowers the entry barrier, enabling developers to smoothly run 26B MoE and 31B dense models on consumer-grade GPUs and support real-time AI interaction on mobile devices with lower power consumption.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish