According to Perceive Beating monitoring, simultaneous interpretation systems are evolving from monotonous voice translation to becoming full-modal digital interpreters that can understand visuals and clone voices. On May 19, Alibaba Tongyi Lab officially announced the release of the new generation real-time audio-video simultaneous interpretation large model Qwen3.5-LiveTranslate, significantly upgrading real-time interpretation capability to over 3,500 language pairs and introducing real-time voice cloning, customizable hotwords, and visual understanding.
The new model is based on the Qwen3.5-Omni architecture, now supporting comprehension and composition in 60 languages, as well as voice outputs in 29 languages.
Differing from traditional simultaneous interpretation software that only listens to voices, the new model incorporates real-time visual context to eliminate semantic ambiguities. For example, when a specific mask appears in the video frame, the system can combine visual features to accurately distinguish between a medical mask and a masquerade ball mask in English, compensating for the lack of audio information.
To eliminate transcription biases from noise and accents, the new model also introduces a hotword dynamic injection mechanism. This allows users to directly specify particular names, brands, or industry terms in the translation flow, forcibly locking in the correct translation to prevent proper nouns from drifting during interpretation.
During cross-language interpretation, the model also supports real-time voice cloning, able to dynamically reproduce the speaker's original voice timbre and intonation within the interpretation flow.
Currently, the new model is available on the Qwen Omni experience platform, with future APIs set to launch on the Alibaba Cloud Everest Platform.
