2025-12-12 · Google

Improved Gemini audio models for powerful voice experiences

models

read at source ↗ deepmind.google

Improved Gemini audio models for powerful voice experiences

Source: DeepMind Date: 2025-12-12 URL: https://deepmind.google/blog/improved-gemini-audio-models-for-powerful-voice-experiences/

Summary

Google updated Gemini 2.5 Flash Native Audio with three improvements: instruction following jumping from 84% to 90%, sharper function calling for external data integration (71.5% on ComplexFuncBench Audio, leading the benchmark), and smoother multi-turn conversation context. Live speech-to-speech translation in 70+ languages launched in Google Translate for Android in the US, Mexico, and India. Available on Vertex AI (GA) and Gemini API (preview).

Implications

The function-calling + voice combination is the agentic audio play. 71.5% on ComplexFuncBench Audio with multi-step function calling is the feature that matters for production voice agents — the ability to trigger external data sources while in audio conversation. That’s the capability that transforms a voice assistant into a voice agent.

90% instruction following is the enterprise unlock. Going from 84% to 90% instruction adherence sounds incremental, but it’s the difference between “useful demo” and “deployable product” for enterprise voice deployments where instructions encode compliance and workflow constraints.

Live translation in Translate is the consumer wedge. Rolling out 70+ language live speech translation to Google Translate (hundreds of millions of users) is the distribution play. That’s not a developer API announcement — it’s consumer adoption scale for audio AI capabilities.

Watch:

  • ComplexFuncBench Audio adoption by independent evaluators — does the benchmark hold up as a standard?
  • Whether Vertex AI audio API adoption accelerates in contact center and customer service verticals
  • ElevenLabs, OpenAI TTS, and Microsoft Azure’s competitive audio API response

← all signals