header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Google TurboQuant: 3-bit Quantization KV Cache with Zero Loss of Precision, Inference Up to 8x Faster

According to 1M AI News, Google Research has released the TurboQuant quantization compression algorithm, which can compress the key-value cache of large language models to 3 bits, reduce memory usage by at least 6 times, without the need for training or fine-tuning, and without compromising model accuracy. In 4-bit mode, the computation speed of attention on an NVIDIA H100 GPU is up to 8 times higher than the 32-bit unquantized baseline.

The research team validated the Gemma and Mistral models on long-context benchmarks such as LongBench, Needle In A Haystack, and ZeroSCROLLS, and TurboQuant achieved optimal performance in all tests. The algorithm consists of two sub-algorithms: PolarQuant eliminates the memory overhead of traditional quantization methods through polar coordinate transformation, and QJL corrects residual errors using only 1 bit.

The research, led by Google Research's Amir Zandieh and Vice President & Google Fellow Vahab Mirrokni, was conducted in collaboration with KAIST in South Korea and New York University, and will be presented at ICLR 2026. Google stated that one of the key applications of this technology is to address the key-value cache bottleneck in models like Gemini.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish