header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

xAI Launches Open Grok Speech-to-Text and Text-to-Speech Audio API, Reducing STT Word Error Rate to 6.9%

According to Perceive Beating monitoring, xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech. Both are from the same audio stack that powers Grok Voice, Tesla's in-car system, and Starlink customer service, now opened as standalone endpoints for developers to directly integrate into voice assistants, real-time transcription, accessibility tools, podcasts, and more.

STT provides two modes. The REST API is used for batch transcription of large audio files with millisecond-level response times, while the WebSocket API is designed for real-time speech streaming. Additional capabilities include word-level timestamps, speaker diarization, multi-channel separation recognition, and Inverse Text Normalization, which automatically formats spoken numbers, dates, and currencies into structured text. Supporting 25+ languages, seamless language switching during conversations is enabled.

xAI also released a set of Word Error Rate (WER, lower is better) comparisons: overall scenario Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%; the gap widens in "telephone conversation entity recognition," with Grok at 5.0%, compared to 12.0%, 13.5%, and 21.3% for the three respective competitors. Grok also slightly outperforms in common business scenarios such as meetings, video podcasts, and phone calls. These numbers were self-tested and published by xAI, with no third-party verification yet.

In terms of pricing, STT batch processing costs $0.10 per hour, streaming is $0.20 per hour; TTS is priced at $4.20 per 1 million characters.

TTS supports inline Speech Tags to control emotion and rhythm, such as `[laugh]`, `[sigh]`, `[whisper]`, `

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish