According to PerfX Beating monitoring, Ryan Lee, Developer Relations Lead of MiniMax, announced that the high-performance attention library MiniMax Sparse Attention (MSA) for the NVIDIA Blackwell (SM100) GPU has been officially open-sourced under the MIT license. Ryan Lee also mentioned that the MiniMax-M3 weights are expected to be released this Friday.
MSA has been applied to the million-scale context reasoning of MiniMax-M3, filtering the most relevant KV Block in each GQA group and performing attention calculation only on the selected blocks. The paper shows that with a context of 1 million tokens, compared to a Dense GQA with the same configuration, MSA can reduce attention computation by 28.4 times and achieve 14.2 times prefill acceleration and 7.6 times decoding acceleration on the H800 GPU.
The open-source version integrates both C++ JIT and CuTe-DSL implementations in the same Python package, while providing Dense FlashAttention and Sparse Top-k Attention Kernel, supporting various precision formats such as BF16, FP8, NVFP4, and FP4. Currently, it is mainly deployed for the NVIDIA Blackwell (SM100) GPU.
