According to Dynamic Vision Beating monitoring, the domestic large-scale model manufacturer MiniMax has officially open-sourced the native multimodal ensemble expert (MoE) model MiniMax M3 weights on Hugging Face. The MiniMax M3 has a total parameter count of 428 billion, with each token activation requiring 230 billion parameters, natively supporting a 1 million token super-long context. To reduce deployment GPU memory overhead, the development team simultaneously released the MXFP8 quantized version and adapted it for mainstream inference frameworks such as SGLang, vLLM, and Transformers.
In terms of multimodal design, MiniMax M3 conducts joint training of text, image, and video during the pre-training phase to achieve native semantic fusion, instead of performing multimodal alignment in the post-training phase. In its operational mechanism, the model provides a dual reasoning mode, consisting of a Thinking mode for complex logic and tool orchestration, and a Non-thinking mode for low-latency dialogue and code generation.
Powering the underlying core for a million token super-long context is the concomitantly open-sourced lightweight attention core library MiniMax Sparse Attention (referred to as MSA). Official data indicates that MSA employs Grouped Query Attention (GQA) chunked retrieval mechanism. In real-world testing with a 1 million token extremely long context, the MSA operator optimized for the NVIDIA Blackwell (SM100) architecture achieves over 9 times prefilling acceleration and 15 times decoding speedup compared to traditional full attention mechanisms, while significantly reducing inference overhead.
