2026-04-15 · Google

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

pricingmodelsresearch

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

Source: DeepMind Date: 2026-04-15 URL: https://deepmind.google/blog/gemini-3-1-flash-tts-the-next-generation-of-expressive-ai-speech/

Summary

Google released Gemini 3.1 Flash TTS with controllable expressivity via inline audio tags — natural language commands embedded in text for granular vocal style, pace, and accent control mid-sentence. Achieved ELO 1,211 on the Artificial Analysis TTS leaderboard (blind human preference, thousands of evaluations), ranking in the “most attractive quadrant” for quality-cost balance. Supports 70+ languages with native multi-speaker dialogue. SynthID watermarking on all output. Available in preview via Gemini API, AI Studio, Vertex AI, and Google Vids.

Implications

Audio tags as a new TTS control primitive. Inline natural language audio tags — “speak this part slower,” “use an excited tone here” — are a qualitatively different interface than existing TTS APIs that require separate SSML markup or prosody parameters. If the tags are expressive enough, this becomes the developer-preferred TTS interface. ElevenLabs and OpenAI TTS will need to respond with comparable controllability.

ELO 1,211 on Artificial Analysis is a credible third-party benchmark. Artificial Analysis is an independent evaluation platform, not a Google-run leaderboard. Top placement on quality-cost means Flash TTS is competitive on the two dimensions enterprises actually optimize for. That’s a stronger claim than internal evals.

Native multi-speaker dialogue at 70+ languages is the podcast/video production unlock. Multi-speaker generation without post-processing audio stitching is the feature that enables automated dialogue production — podcast creation, video dubbing, accessibility narration — at scale. The language coverage makes this relevant for global content operations.

Watch:

Whether audio tag controllability holds for complex prosody requests in non-English languages
ElevenLabs and OpenAI response — both have strong TTS offerings and will respond to a competitive ELO challenge
Google Vids integration adoption — that’s the consumer surface where TTS becomes a mainstream creative tool

← all signals