According to Perceive Beating monitoring, xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech. Both are from the same audio stack that powers Grok Voice, Tesla's in-car system, and Starlink customer service, now opened as standalone endpoints for developers to directly integrate into voice assistants, real-time transcription, accessibility tools, podcasts, and more.
STT provides two modes. The REST API is used for batch transcription of large audio files with millisecond-level response times, while the WebSocket API is designed for real-time speech streaming. Additional capabilities include word-level timestamps, speaker diarization, multi-channel separation recognition, and Inverse Text Normalization, which automatically formats spoken numbers, dates, and currencies into structured text. Supporting 25+ languages, seamless language switching during conversations is enabled.
xAI also released a set of Word Error Rate (WER, lower is better) comparisons: overall scenario Grok 6.9%, ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%; the gap widens in "telephone conversation entity recognition," with Grok at 5.0%, compared to 12.0%, 13.5%, and 21.3% for the three respective competitors. Grok also slightly outperforms in common business scenarios such as meetings, video podcasts, and phone calls. These numbers were self-tested and published by xAI, with no third-party verification yet.
In terms of pricing, STT batch processing costs $0.10 per hour, streaming is $0.20 per hour; TTS is priced at $4.20 per 1 million characters.
TTS supports inline Speech Tags to control emotion and rhythm, such as `[laugh]`, `[sigh]`, `[whisper]`, `
