Live dialogue
Fluid and natural live dialogue and translation capabilities, for powerful voice-first applications
Fluid and natural live dialogue and translation capabilities, for powerful voice-first applications
Best for low-latency, fluid and natural vocal rhythm. Solves complex tasks while recognizing nuances in voices like pitch and pace.
Best for real-time speech-to-speech translation, overcoming language barriers across 70+ languages.
Agents can use tools and retrieve real-time information from sources like Google Search, or even custom developer-built apps, making conversations more useful.
Knows the difference between direct engagement and background chatter. Understands the rhythm of speech, knowing precisely when to talk – and when to stay silent.
Maintains specific personas and follows directions during long interactions. Even in complex, winding conversations, the model stays on track – and in character.
Handles complex alphanumeric data – like flight codes or pricing – reliably and accurately.
Delivers fluid speech-to-speech translation across 70+ languages and 2,000 language pairs.
Preserves the speaker’s original intonation, pacing and pitch to capture not just what they said, but how they said it.
Translates multiple languages in a single session – no need to change the settings.
Identifies the language being spoken and begins translation, without being told what it is.
Filters out ambient noise so audio stays crisp and clear, even in loud outdoor environments.
Minimizes processing lag to translate speech instantly and eliminate awkward pauses – keeping conversations flowing naturally.
We’ve always been focused on building user-first products that address real-world problems, and we’re excited to partner with Google DeepMind to explore novel ways to overcome language barriers between our passengers and driver partners. Testing Gemini 3.5 Live Translate, we’ve been impressed by its ability to auto-detect multiple languages, and translate speech accurately with low latency.
Gemini 3.5 Live Translate makes multilingual voice effortless. I built a demo on LiveKit Agents where everyone speaks their own language and understands each other live.
Gemini 3.5 Live Translate paired with Fishjam’s MoQ protocol sets a new frontier for real-time multimedia streaming, allowing speech-to-speech translation into 90 languages
We tested the Gemini 3.5 Live Translate model at Agora and, in our opinion, it provided SOTA results, with low latency and high accuracy that set a new bar for real-time translation.
Integrating Google Gemini’s 3.1 audio models has improved The Home Depot’s contact center experience. This conversational AI enables natural, intuitive dialogue, moving away from rigid scripts to foster deeper engagement. The platform accurately captures complex details—like alphanumeric product codes—even in noisy environments. And real-time translations allow our customers to switch languages seamlessly, making our "orange-apron" expertise more accessible and scalable than ever.
Using Gemini 3.1 Flash Live in our Miri AI makeup and outfit assistant has noticeably raised the bar for real-time conversational interactions — instruction adherence in long 20+ turn sessions improved by ~25% over 2.5 Flash, and tool calls for memory recall and structured entry executed reliably in our tests. Voice responses feel significantly more natural, with clearer sentence-level intonation and stable persona performance, making a real difference in how users engage with the app.
Google's Audio-to-Audio capability moved the sound of our Virtual Agents to more natural sounding agents. It enabled us to have more control of its pitch and intonation. And just moving to Audio-to-Audio removes the huge latency we experience when the Virtual Agents have plenty to convey back to the customers. Planning ahead, we've seen the preview of the Gemini 3.1 voice model, and this would just further improve the experience, towards a more natural conversation that Verizon is striving for.
YouTube is transforming its contact center operations by deploying Google’s Gemini CX Agent Studio. Leveraging the Gemini 3.1 Flash Live native audio model, the platform now delivers ultra-low latency and natural voice interactions, delivering positive customer engagement and significantly increasing support capacity. This AI-driven evolution enhances the entire lifecycle—from acquisition to retention—by ensuring rapid, high-quality resolutions. Notably, during high-demand events like NFL Sunday Ticket, viewers now receive near-instantaneous support through kickoff, proving that YouTube can deliver reliable, top-tier service at any scale.
Maximum possible score is 100%. Axis scaled to 40% to highlight the differences between models. Methodology: All results sourced from Scale. The visual showcases only audio output models.
Data as of March 26, 2026
This multi-turn benchmark assesses the conversational proficiency of audio-language models and spoken dialogue systems, including speech-to-speech variants. It evaluates their capacity to follow instructions, maintain self-consistency, integrate previous context, and manage natural speech corrections throughout long-form dialogues.
Original text prompts were converted to audio and then function calling accuracy was evaluated according to original benchmark methodology.
Data as of March 26, 2026
This static context multi-turn benchmark measures the model's ability to perform a sequence of interdependent function calls related to travel booking. Since this was originally a text-to-text evaluation, we synthesized audio for each prompt. Then used the published scoring apparatus to evaluate the performance of the Gemini realtime API. More details on ComplexFuncBench can be found here.
Methodology: All results sourced from Artificial Analysis.
Data as of March 26, 2026
This single turn benchmark consists of 1,000 audio recordings that pair an audio clip (ranging from speech to natural sounds) with a text question. It measures five diverse audio comprehension skills: audio captioning, speech understanding, audio scene understanding, accent/language identification, and sound recognition.
| Name | 3.1 Flash Live | 3.5 Live Translate |
|---|---|---|
| Status | Preview | Preview |
| Input |
|
|
| Output |
|
|
| Input tokens | 128k | 128k |
| Output tokens | 64k | 64k |
| Knowledge cutoff | January 2025 | January 2025 |
| Availability |
|
|
| Documentation | View developer docs | View developer docs |
| Model card | View model card | View model card |
The fastest path from prompt to production
Get started with cutting-edge AI models
Low-latency, real-time voice and video interactions with Gemini
Deploy specialized agents for product discovery, shopping, and customer service
Understand your world and communicate across languages