Gemini Audio

Live dialogue

Fluid and natural live dialogue and translation capabilities, for powerful voice-first applications

Models
Capabilities
Showcase
Performance
Model information
Try Live dialogue

Models

Create agents capable of handling complex tasks and using tools, while engaging in natural-sounding conversations.

3.1 Flash Live

Best for low-latency, fluid and natural vocal rhythm. Solves complex tasks while recognizing nuances in voices like pitch and pace.

Try it in

Google AI Studio

3.5 Live Translate

Best for real-time speech-to-speech translation, overcoming language barriers across 70+ languages.

Try it in

Google AI Studio

Capabilities

Build natural-sounding and reliable voice agents with our live dialogue models.

Your browser does not support the video tag.

Real-time action

Agents can use tools and retrieve real-time information from sources like Google Search, or even custom developer-built apps, making conversations more useful.

Your browser does not support the video tag.

Proactive listening

Knows the difference between direct engagement and background chatter. Understands the rhythm of speech, knowing precisely when to talk – and when to stay silent.

Your browser does not support the video tag.

Consistent responses

Maintains specific personas and follows directions during long interactions. Even in complex, winding conversations, the model stays on track – and in character.

Alphanumeric accuracy

Handles complex alphanumeric data – like flight codes or pricing – reliably and accurately.

Live speech translation

Uses Gemini’s speech-to-speech translation capabilities to break down language barriers.

Broad language coverage

Delivers fluid speech-to-speech translation across 70+ languages and 2,000 language pairs.

Consistent intonation

Preserves the speaker’s original intonation, pacing and pitch to capture not just what they said, but how they said it.

Multilingual input

Translates multiple languages in a single session – no need to change the settings.

Automatic language detection

Identifies the language being spoken and begins translation, without being told what it is.

Noise robustness

Filters out ambient noise so audio stays crisp and clear, even in loud outdoor environments.

Real-time latency

Minimizes processing lag to translate speech instantly and eliminate awkward pauses – keeping conversations flowing naturally.

Showcase

We’ve always been focused on building user-first products that address real-world problems, and we’re excited to partner with Google DeepMind to explore novel ways to overcome language barriers between our passengers and driver partners. Testing Gemini 3.5 Live Translate, we’ve been impressed by its ability to auto-detect multiple languages, and translate speech accurately with low latency.

Philipp Kandal, Chief Product Officer

Grab

Gemini 3.5 Live Translate makes multilingual voice effortless. I built a demo on LiveKit Agents where everyone speaks their own language and understands each other live.

Jesse Hall, Staff Developer Advocate

LiveKit

Gemini 3.5 Live Translate paired with Fishjam’s MoQ protocol sets a new frontier for real-time multimedia streaming, allowing speech-to-speech translation into 90 languages

Maciej Rys, VP of Engineering

Software Mansion

We tested the Gemini 3.5 Live Translate model at Agora and, in our opinion, it provided SOTA results, with low latency and high accuracy that set a new bar for real-time translation.

Mason Adams, Developer Evangelist

Agora

Integrating Google Gemini’s 3.1 audio models has improved The Home Depot’s contact center experience. This conversational AI enables natural, intuitive dialogue, moving away from rigid scripts to foster deeper engagement. The platform accurately captures complex details—like alphanumeric product codes—even in noisy environments. And real-time translations allow our customers to switch languages seamlessly, making our "orange-apron" expertise more accessible and scalable than ever.

The Home Depot

Using Gemini 3.1 Flash Live in our Miri AI makeup and outfit assistant has noticeably raised the bar for real-time conversational interactions — instruction adherence in long 20+ turn sessions improved by ~25% over 2.5 Flash, and tool calls for memory recall and structured entry executed reliably in our tests. Voice responses feel significantly more natural, with clearer sentence-level intonation and stable persona performance, making a real difference in how users engage with the app.

Orhan Dalgara Co-Founder

Wavera

Google's Audio-to-Audio capability moved the sound of our Virtual Agents to more natural sounding agents. It enabled us to have more control of its pitch and intonation. And just moving to Audio-to-Audio removes the huge latency we experience when the Virtual Agents have plenty to convey back to the customers. Planning ahead, we've seen the preview of the Gemini 3.1 voice model, and this would just further improve the experience, towards a more natural conversation that Verizon is striving for.

Verizon

YouTube is transforming its contact center operations by deploying Google’s Gemini CX Agent Studio. Leveraging the Gemini 3.1 Flash Live native audio model, the platform now delivers ultra-low latency and natural voice interactions, delivering positive customer engagement and significantly increasing support capacity. This AI-driven evolution enhances the entire lifecycle—from acquisition to retention—by ensuring rapid, high-quality resolutions. Notably, during high-demand events like NFL Sunday Ticket, viewers now receive near-instantaneous support through kickoff, proving that YouTube can deliver reliable, top-tier service at any scale.

YouTube

Performance

Our live dialogue models can be used to build powerful, low-latency voice applications, capable of complex reasoning. High-performance, reliable deployment.

Maximum possible score is 100%. Axis scaled to 40% to highlight the differences between models. Methodology: All results sourced from Scale. The visual showcases only audio output models.

Data as of March 26, 2026

Audio MultiChallenge

This multi-turn benchmark assesses the conversational proficiency of audio-language models and spoken dialogue systems, including speech-to-speech variants. It evaluates their capacity to follow instructions, maintain self-consistency, integrate previous context, and manage natural speech corrections throughout long-form dialogues.

Original text prompts were converted to audio and then function calling accuracy was evaluated according to original benchmark methodology.

Data as of March 26, 2026

ComplexFuncBench Audio

This static context multi-turn benchmark measures the model's ability to perform a sequence of interdependent function calls related to travel booking. Since this was originally a text-to-text evaluation, we synthesized audio for each prompt. Then used the published scoring apparatus to evaluate the performance of the Gemini realtime API. More details on ComplexFuncBench can be found here.

Methodology: All results sourced from Artificial Analysis.

Data as of March 26, 2026

Big Bench Audio

This single turn benchmark consists of 1,000 audio recordings that pair an audio clip (ranging from speech to natural sounds) with a text question. It measures five diverse audio comprehension skills: audio captioning, speech understanding, audio scene understanding, accent/language identification, and sound recognition.

Model information

Name	3.1 Flash Live	3.5 Live Translate
Status	Preview	Preview
Input	Text Image Video Audio	Audio
Output	Text Audio	Text Audio
Input tokens	128k	128k
Output tokens	64k	64k
Knowledge cutoff	January 2025	January 2025
Availability	Gemini App Google AI Studio Gemini API Google Antigravity NotebookLM	Google Translate Google AI Studio Gemini API
Documentation	View developer docs	View developer docs
Model card	View model card	View model card

Try Live dialogue

Google AI Studio

The fastest path from prompt to production

Try in Google AI Studio

Gemini API

Get started with cutting-edge AI models 

Learn more

Gemini Live API

Low-latency, real-time voice and video interactions with Gemini

Learn more

Gemini Enterprise for Customer Experience

Deploy specialized agents for product discovery, shopping, and customer service

Learn more

Google Translate

Understand your world and communicate across languages

Learn more

Explore our next generation AI systems

Our latest AI breakthroughs and updates from the lab

Unlocking a new era of discovery with AI

Our mission is to build AI responsibly to benefit humanity

Live dialogue

Models

3.1 Flash Live

Try it in

3.5 Live Translate

Try it in

Capabilities

Real-time action

Proactive listening

Consistent responses

Alphanumeric accuracy

Live speech translation

Broad language coverage

Consistent intonation

Multilingual input

Automatic language detection

Noise robustness

Real-time latency

Showcase

Performance

Audio MultiChallenge

ComplexFuncBench Audio

Big Bench Audio

Model information

Try Live dialogue

Google AI Studio

Gemini API

Gemini Live API

Gemini Enterprise for Customer Experience

Google Translate