Live dialogue
Fluid and natural live dialogue and translation capabilities, for powerful voice-first applications
Fluid and natural live dialogue and translation capabilities, for powerful voice-first applications
Best for low-latency, fluid and natural vocal rhythm. Solves complex tasks while recognizing nuances in voices like pitch and pace.
Agents can use tools and retrieve real-time information from sources like Google Search, or even custom developer-built apps, making conversations more useful.
Knows the difference between direct engagement and background chatter. Understands the rhythm of speech, knowing precisely when to talk – and when to stay silent.
Maintains specific personas and follows directions during long interactions. Even in complex, winding conversations, the model stays on track – and in character.
Handles complex alphanumeric data – like flight codes or pricing – reliably and accurately.
Delivers fluid speech-to-speech translation across 70+ languages and 2,000 language pairs, dissolving communication barriers in real-time.
Preserves the original speaker’s intonation, pacing and pitch, to convey not just what’s said, but how it’s said.
Understands multiple languages in a single session, to help you follow multilingual conversations without changing any settings.
Identifies the language being spoken and starts translating – so you don’t need to figure it out yourself.
Filters out ambient noise so you can hold conversations comfortably, even in loud outdoor environments.
Integrating Google Gemini’s 3.1 audio models has improved The Home Depot’s contact center experience. This conversational AI enables natural, intuitive dialogue, moving away from rigid scripts to foster deeper engagement. The platform accurately captures complex details—like alphanumeric product codes—even in noisy environments. And real-time translations allow our customers to switch languages seamlessly, making our "orange-apron" expertise more accessible and scalable than ever.
Using Gemini 3.1 Flash Live in our Miri AI makeup and outfit assistant has noticeably raised the bar for real-time conversational interactions — instruction adherence in long 20+ turn sessions improved by ~25% over 2.5 Flash, and tool calls for memory recall and structured entry executed reliably in our tests. Voice responses feel significantly more natural, with clearer sentence-level intonation and stable persona performance, making a real difference in how users engage with the app.
The biggest win with the new Gemini 3.1 audio model is consistency over time. Longer sessions stay on-persona with far less speaker drift, and the experience feels both more natural and more dependable for production-style voice agents.
Google's Audio-to-Audio capability moved the sound of our Virtual Agents to more natural sounding agents. It enabled us to have more control of its pitch and intonation. And just moving to Audio-to-Audio removes the huge latency we experience when the Virtual Agents have plenty to convey back to the customers. Planning ahead, we've seen the preview of the Gemini 3.1 voice model, and this would just further improve the experience, towards a more natural conversation that Verizon is striving for.
YouTube is transforming its contact center operations by deploying Google’s Gemini CX Agent Studio. Leveraging the Gemini 3.1 Flash Live native audio model, the platform now delivers ultra-low latency and natural voice interactions, delivering positive customer engagement and significantly increasing support capacity. This AI-driven evolution enhances the entire lifecycle—from acquisition to retention—by ensuring rapid, high-quality resolutions. Notably, during high-demand events like NFL Sunday Ticket, viewers now receive near-instantaneous support through kickoff, proving that YouTube can deliver reliable, top-tier service at any scale.
Maximum possible score is 100%. Axis scaled to 40% to highlight the differences between models. Methodology: All results sourced from Scale. The visual showcases only audio output models.
Data as of March 26, 2026
This multi-turn benchmark assesses the conversational proficiency of audio-language models and spoken dialogue systems, including speech-to-speech variants. It evaluates their capacity to follow instructions, maintain self-consistency, integrate previous context, and manage natural speech corrections throughout long-form dialogues.
Original text prompts were converted to audio and then function calling accuracy was evaluated according to original benchmark methodology.
Data as of March 26, 2026
This static context multi-turn benchmark measures the model's ability to perform a sequence of interdependent function calls related to travel booking. Since this was originally a text-to-text evaluation, we synthesized audio for each prompt. Then used the published scoring apparatus to evaluate the performance of the Gemini realtime API. More details on ComplexFuncBench can be found here.
Methodology: All results sourced from Artificial Analysis.
Data as of March 26, 2026
This single turn benchmark consists of 1,000 audio recordings that pair an audio clip (ranging from speech to natural sounds) with a text question. It measures five diverse audio comprehension skills: audio captioning, speech understanding, audio scene understanding, accent/language identification, and sound recognition.
The fastest path from prompt to production
Get started with cutting-edge AI models
Low-latency, real-time voice and video interactions with Gemini
Deploy specialized agents for product discovery, shopping, and customer service