According to Insight Beating monitoring, Google has deployed a Multi-Token Prediction (MTP) architecture in the Pixel 9 and Pixel 10 series devices, directly accelerating the onboard Gemini Nano v3 model. By attaching a lightweight Transformer prediction head to the frozen main model's tail, the new architecture has increased on-device inference speed by over 50% while fully retaining the original security alignment and output quality.
Traditional greedy decoding requires running a separate draft model to predict candidate tokens. This not only consumes additional phone runtime memory but also limits prediction accuracy as the standalone model cannot access the main model's internal hidden state. The new architecture, by embedding an MTP head at the frozen main model's tail, successfully reuses the main model's computed feature activations, significantly improving the prediction accuracy of candidate tokens.
To avoid duplicate runtime memory overhead during autoregressive generation from draft computations, Google designed a zero-copy mechanism. In the traditional approach, the draft model needs to maintain an independent key-value cache memory when generating candidate words, whereas the zero-copy mechanism allows the attached prediction head to directly read the existing cache of the main model through cross-attention. This not only eliminates the startup delay of draft prediction but also saves the phone approximately 130MB of runtime memory space.
In Pixel's actual business scenarios such as notification summaries and text proofing, the MTP architecture enables the model to successfully predict nearly 2 extra tokens on average during a single inference, reducing the frequency of wake-ups for the main processor due to validation checks, thus saving system power. In highly structured text generation tasks like smart replies, the Token acceptance rate has improved by 55%.
