According to 1M AI News, Tongyi Lab has released the full-modal model Qwen3.5-Omni, supporting text, image, audio, and audio-visual inputs, and capable of generating fine-grained audio-video captions with timestamps. The official statement mentions that Qwen3.5-Omni-Plus has achieved 215 SOTA in tasks such as audio and video analysis, inference, dialogue, translation, surpassing the capabilities of Gemini-3.1-Pro.
The most notable incremental improvement this time is not in the rankings but in the "naturally emerging Audio-Visual Vibe Coding capability." Tongyi states that the model, without specific training, can already generate executable code directly based on audio-visual commands. The official statement also mentions that the model supports 256K context, recognizes 113 languages, can process 10 hours of audio or 1 hour of video, and natively supports WebSearch and complex Function Call.
Qwen3.5-Omni continues the Thinker-Talker divisional architecture, with both parts upgraded to Hybrid-Attention MoE. Tongyi has provided three sizes through Alibaba Cloud Wushuang: Plus, Flash, Light, and launched the real-time version Qwen3.5-Omni-Plus-Realtime.
