According to Perceiving Excellence monitoring, RedX hi lab has open-sourced a 2 billion-parameter end-to-end autoregressive Text-to-Speech (TTS) model called dots.tts, and has released the full inference and fine-tuning code under the Apache 2.0 license. The publicly released weights include the base pretrained version, self-correcting alignment (SCA) fine-tuned version, and low-latency inference distillation version.
Unlike traditional TTS architectures relying on Discrete Codec Tokens for audio encoding and decoding (such as VALL-E, CosyVoice, ChatTTS, etc.), dots.tts has achieved a fully continuous, end-to-end autoregressive flow-based architecture, completely avoiding the use of any discrete tokens throughout the entire pipeline. dots.tts combines continuous features extracted from AudioVAE at a 48 kHz sampling rate with a semantic encoder, a base language model (initialized from Qwen2.5-1.5B-Base, directly processing BPE text without Pinyin input), and an autoregressive flow-based acoustics head to predict continuous latent variables, which are then reconstructed into audio by the generator. By directly predicting continuous features, dots.tts bypasses any audio quality loss caused by discrete quantization, preserving pronunciation details, timbre similarity, and expressive emotion.
dots.tts is pretrained on about 1.5 million hours of speech data. In the Seed-TTS-Eval evaluation, dots.tts achieved a word error rate (WER) of 0.94% / 1.30% / 6.60% on Chinese, English, and Chinese difficult test sets, respectively, with similarity scores (SIM) of 81.0 / 77.1 / 79.5, all reaching the state-of-the-art open-source level. In the MiniMax Multilingual benchmark test for 24 languages, the average speaker similarity reached 83.9. RedX has provided a Gradio experience space on Hugging Face for users to test zero-shot voice cloning online.
