header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Google has released Gemini 3.1 Flash TTS, allowing developers to command AI on how to speak naturally like a director.

According to DataInsight Beating monitoring, Google has released the new generation text-to-speech model Gemini 3.1 Flash TTS. The key selling point is not "more human-like," but that developers can precisely control the AI voice's style, speech rate, and emotional expression. The model has been launched on Gemini API, Google AI Studio (Developer Preview), Vertex AI (Enterprise Preview), and Google Vids (Workspace users).

The key to this control capability is "audio tags": developers can embed natural language commands in the input text to adjust the AI voice's intonation, rhythm, and accent, even switching expressive styles in the middle of a sentence. Google provides a "director's chair" style configuration interface in Google AI Studio, with three levels of control:

1. Scene Guidance: Set the environment and dialogue instructions to keep the character's personality consistent in multi-turn conversations.
2. Character-Level Tuning: Assign independent audio configurations to each character, individually control speech rate, intonation, and accent.
3. One-Click Export: The tuned parameters can be directly exported as Gemini API code for reuse across different projects and platforms.

In the TTS leaderboard of the third-party evaluation agency Artificial Analysis, Gemini 3.1 Flash TTS topped the list with a 1211 Elo score, which is based on thousands of human blind preference tests. Artificial Analysis also includes it in the "Most Attractive Quadrant," meaning high voice quality and low cost. The model supports over 70 languages and native multi-character dialogues, and all generated audios are embedded with SynthID watermark for AI content recognition.

For developers, this means that TTS has transformed from a tool for "reading out text" to a programmable voice performance engine. In the past, to create emotional AI voices, you either relied on post-processing or painstakingly annotated SSML markup language. Now, it can be done with a simple natural language sentence. Combined with the one-click export feature, the same voice style can be reused across product lines, which is particularly useful for enterprises that need a unified brand voice.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish