header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Red Book (Xiaohongshu) has open-sourced the end-to-end neural text-to-speech (TTS) model dots.tts, supporting zero-shot voice cloning

According to Perceiving Excellence monitoring, RedX hi lab has open-sourced a 2 billion-parameter end-to-end autoregressive Text-to-Speech (TTS) model called dots.tts, and has released the full inference and fine-tuning code under the Apache 2.0 license. The publicly released weights include the base pretrained version, self-correcting alignment (SCA) fine-tuned version, and low-latency inference distillation version.

Unlike traditional TTS architectures relying on Discrete Codec Tokens for audio encoding and decoding (such as VALL-E, CosyVoice, ChatTTS, etc.), dots.tts has achieved a fully continuous, end-to-end autoregressive flow-based architecture, completely avoiding the use of any discrete tokens throughout the entire pipeline. dots.tts combines continuous features extracted from AudioVAE at a 48 kHz sampling rate with a semantic encoder, a base language model (initialized from Qwen2.5-1.5B-Base, directly processing BPE text without Pinyin input), and an autoregressive flow-based acoustics head to predict continuous latent variables, which are then reconstructed into audio by the generator. By directly predicting continuous features, dots.tts bypasses any audio quality loss caused by discrete quantization, preserving pronunciation details, timbre similarity, and expressive emotion.

dots.tts is pretrained on about 1.5 million hours of speech data. In the Seed-TTS-Eval evaluation, dots.tts achieved a word error rate (WER) of 0.94% / 1.30% / 6.60% on Chinese, English, and Chinese difficult test sets, respectively, with similarity scores (SIM) of 81.0 / 77.1 / 79.5, all reaching the state-of-the-art open-source level. In the MiniMax Multilingual benchmark test for 24 languages, the average speaker similarity reached 83.9. RedX has provided a Gradio experience space on Hugging Face for users to test zero-shot voice cloning online.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish