According to Perceiving Beating monitoring, the Alibaba Big Model Team has released the Embodied Intelligence Base Model Suite Qwen-Robot Suite, which includes three base models: Qwen-RobotNav, Qwen-RobotManip, and Qwen-RobotWorld, corresponding to the navigation, manipulation, and world simulation areas of physical actions. The suite aims to align visual-language models with physical actions to achieve multi-tasking and multi-robot embodiment generalization.
The navigation model, Qwen-RobotNav, integrates tasks such as instruction following, target navigation, target tracking, and autonomous driving. In its design, the model parameterizes visual attention strategies, supporting dynamic adjustments of visual token budget and frame sampling during inference. Trained on 15.6 million samples, Qwen-RobotNav has achieved SOTA in 5 navigation domains and has been zero-shot deployed on the Yushu Go2 quadruped robot.
The manipulation model, Qwen-RobotManip, is built on the Qwen3.5-4B VL backbone network and flow-matching DiT action head, using an 80-dimensional state-action representation to output end-effector incremental poses. The team trained the model on over 38,100 hours of data, including open-source robot demos, human videos, and human-robot transfer-synthesized data, achieving a 91.4% success rate in the LIBERO-Plus evaluation.
The physical world prediction model, Qwen-RobotWorld, adopts a natural language unified robot action interface. Architecturally, the model couples Qwen2.5-VL semantic representation with video latent variables in depth through a 60-layer dual-stream MMDiT structure. Trained on 8.6 million video-text pairs, Qwen-RobotWorld has ranked first in physical law compliance evaluations such as EWMBench and WorldModelBench.
All three models provide a language-first interface. Alibaba has also introduced the robot intelligence framework Qwen-RobotClaw, allowing upper-level planners (such as Qwen-3.5) to call the suite models as physical tools to perform multi-step operations.
