According to Dynamic Beating monitoring, AI voice model startup Cartesia announced the release of Sonic-3.5 and Ink-2, introducing a unified real-time speech AI technology stack composed of the two models. Sonic-3.5 is responsible for Text-to-Speech (TTS), while Ink-2 handles Speech-to-Text (STT).
Sonic-3.5 focuses on real-time low-latency speech generation, reducing the initial audio output time to 90 milliseconds. It natively supports 42 languages and can pronounce English homographs and alphanumeric characters without preprocessing.
Ink-2 has reduced its Word Error Rate to 3.6% and introduced native turn detection and noise handling mechanisms. It can determine if a user has finished speaking based on sentence context and semantic understanding, instead of solely relying on traditional silence duration. Currently, Ink-2 is available only in English, with multi-language support planned for future releases.
Developers can invoke both models simultaneously through a single API. Sonic-3.5 and Ink-2 are designed to interact bidirectionally to minimize transmission latency and system overhead caused by "multi-vendor stitching."
