2026-03-26 · Google

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

models

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

Source: DeepMind Date: 2026-03-26 URL: https://deepmind.google/blog/gemini-3-1-flash-live-making-audio-ai-more-natural-and-reliable/

Summary

Google launched Gemini 3.1 Flash Live, its highest-quality audio model, with 90.8% on ComplexFuncBench Audio (multi-step function calling) and 36.1% on Audio MultiChallenge with thinking enabled. The model doubles conversation memory length versus its predecessor, dynamically adjusts tone to user emotional signals (frustration, confusion), and supports 200+ countries. All audio is SynthID-watermarked.

Implications

90.8% ComplexFuncBench Audio is the function-calling-in-voice claim that matters for voice agent builders. Multi-step function calling in voice — “book me a restaurant for 8 PM, invite Sarah, and set a reminder an hour before” — is the capability that separates voice assistants from voice search. 90.8% means this works reliably enough for production voice agent workflows, not just demos.

Dynamic emotional adjustment is the naturalness signal, not a sentiment feature. Recognizing frustration and adjusting tone is table stakes for customer service voice AI. The technical claim is that the model adjusts without an explicit prompt — it infers from acoustic cues, not from the user typing “I’m frustrated.” That’s the difference between a voice wrapper and a genuinely conversational model.

Twice the conversation memory is the context thread claim for long sessions. Voice sessions break when models forget what was said three exchanges ago. Doubling memory length extends the viable session duration — which matters for sales calls, medical consultations, and customer service escalations where the full conversation context is the document being processed.

Watch:

Audio MultiChallenge 36.1% with thinking — what’s the baseline without thinking, and what’s the performance cost in latency?
SynthID audio watermarking: when third-party audio platforms (Spotify, podcast hosts) begin using the detection API to flag AI-generated audio
Enterprise voice AI adoption: which contact center platforms (Five9, NICE, Genesys) integrate Flash Live as the real-time voice backbone?

← all signals