According to 1M AI News, Google Research has released the TurboQuant quantization compression algorithm, which can compress the key-value cache of large language models to 3 bits, reduce memory usage by at least 6 times, without the need for training or fine-tuning, and without compromising model accuracy. In 4-bit mode, the computation speed of attention on an NVIDIA H100 GPU is up to 8 times higher than the 32-bit unquantized baseline.
The research team validated the Gemma and Mistral models on long-context benchmarks such as LongBench, Needle In A Haystack, and ZeroSCROLLS, and TurboQuant achieved optimal performance in all tests. The algorithm consists of two sub-algorithms: PolarQuant eliminates the memory overhead of traditional quantization methods through polar coordinate transformation, and QJL corrects residual errors using only 1 bit.
The research, led by Google Research's Amir Zandieh and Vice President & Google Fellow Vahab Mirrokni, was conducted in collaboration with KAIST in South Korea and New York University, and will be presented at ICLR 2026. Google stated that one of the key applications of this technology is to address the key-value cache bottleneck in models like Gemini.
