header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Meituan Open Sources LongCat-Next: 3B Unified Visual Understanding, Generation, and Speech

According to 1M AI News monitoring, the Meituan Totoro team has open-sourced LongCat-Next, a native multimodal model based on the MoE architecture with 3B activation parameters, which unifies text understanding, visual understanding, image generation, speech understanding, and speech synthesis in a single autoregressive framework. The model and its accompanying tokenizer are open source under the MIT license, and the weights have been uploaded to HuggingFace.

The core design of LongCat-Next is the DiNA (Discrete autoregressive Native) paradigm: by designing paired tokenizers and decoders for each modality, visual and audio signals are transformed into discrete tokens that share the same embedding space as text, and all tasks are completed through unified next-token prediction. The key component on the visual side, dNaViT (Discrete Native Vision Transformer), extracts image features into "visual words," supports dynamic tokenization and decoding, maintains strong image generation quality at a compression ratio of 28x, especially excelling in text rendering.

In a model comparison of equal activation parameter magnitude (A3B), LongCat-Next's main benchmark performance is as follows:

1. Visual Understanding: MMMU-Pro 60.3 (Qwen3-Omni 57.0, GPT5-minimal 62.7), MathVista 83.1 (Qwen3-Omni 75.9, GPT5-minimal 50.9), MathVision 64.7 (outperforming all comparison models), DocVQA 94.2
2. Image Generation: GenEval 84.44, LongText-EN 93.15 (FLUX.1-dev 60.70, Emu-3.5 97.60)
3. Programming: SWE-Bench 43.0 (Kimi-Linear-48B 32.8, Qwen3-Next-80B 37.6)
4. Agent Tool Invocation: Tau2-Retail 73.68 (Qwen3-Next 57.3), Tau2-Telecom 62.06 (Qwen3-Next 13.2)

In a cross-model comparison of understanding and generation unified models, LongCat-Next's MMMU score of 70.6 outperforms the second-place NEO-unify (68.9), significantly surpassing previous unified model solutions such as BAGEL (55.3) and Ovis-U1 (51.1). The performance of SWE-Bench 43.0 and the Tau2 series tool invocation benchmarks also demonstrate that this multimodal unified architecture has not sacrificed pure text and agent capabilities.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish