NewsFlash Articles Data Fundraising Skill&API

Baseten Introduces KV Cache Compression Scheme Still: Up to 200x Compression in Single Pass Forward Propagation

According to Dynamic Beating monitoring, the Baseten research team has released the KV Cache Compression method Still. By freezing the base model parameters and training only a lightweight Perceiver compressor, KV cache compression can be completed in a single forward pass, supporting up to a 200x compression ratio without the need for online optimization or gradient updates.

Existing KV compression methods are roughly divided into two categories. One type filters important caches from the original tokens, such as SnapKV and H2O. The other type directly generates new compressed caches but often requires online optimization for each context. Still adopts an Amortized Synthesis approach, integrating Perceiver compressors in each Transformer layer with a parameter size of approximately 1% of the base model. Through implicit queries, the full KV cache is subjected to cross-attention, directly creating a new compressed KV cache.

In tests with the Qwen and Gemma models, Still maintained high accuracy within an 8k to 64k context and 8 to 200x compression ratio range. In the long-context benchmark RULER, Still outperformed methods such as SnapKV, H2O, and KV-Distill in most settings, leading KV-Distill by approximately 8 to 22 percentage points in matched-training tasks. In the HELMET multi-document summarization test, the compressed cache was still able to recover 74% to 95% of the full-context performance gain.

Compared to solutions that require retaining original tokens or rely on online optimization, Still transforms KV compression into a learning process that can be completed in a single forward inference, providing a new cache compression path for ultra-long-context reasoning.

Source

Correction/Report

On-Chain Activity