Built from the ground up to be natively multimodal.


Live voice agents

Enable effective, real-time communication with Gemini 2.5 Flash Native Audio.

Real-time action

Uses tools and calls other functions from chat. That means it can use real-time information from sources like Google Search, or even custom developer-built tools, making conversations more practical.

Conversation context awareness

Knows the difference between direct engagement and background chatter. It understands the rhythm of your speech, knowing precisely when to respond—and when to stay silent.

Robust steerability

Maintain specific personas and guidelines throughout long interactions. Even in complex, winding conversations, the model stays on track – and in character.

Delivering sharper function calling, robust instruction following, and smoother conversations to power the next generation of live voice agents.

A graphic titled featuring three bar charts comparing performance percentages across models. The vertical axes range from 0% to 100%.   The first chart is titled "ComplexFuncBench audio" for "Function calling accuracy." The bars display the following results from left to right: 2.5 Flash Native Audio 12-2025 at 71.5% (the highest score), 2.5 Flash Native Audio 09-2025 at 66.0%, and gpt-realtime-2025-08-28 at 66.5%.   The second chart is titled "Adherence to developer instructions." The bars display the following results from left to right: 2.5 Flash Native Audio 12-2025 at 90% (the highest score), and 2.5 Flash Native Audio 09-2025 at 84%.   The third chart is titled "Overall conversational quality." The bars display the following results from left to right: 2.5 Flash Native Audio 12-2025 at 83% (the highest score), and 2.5 Flash Native Audio 09-2025 at 62%. A graphic titled featuring three bar charts comparing performance percentages across models. The vertical axes range from 0% to 100%.   The first chart is titled "ComplexFuncBench audio" for "Function calling accuracy." The bars display the following results from left to right: 2.5 Flash Native Audio 12-2025 at 71.5% (the highest score), 2.5 Flash Native Audio 09-2025 at 66.0%, and gpt-realtime-2025-08-28 at 66.5%.   The second chart is titled "Adherence to developer instructions." The bars display the following results from left to right: 2.5 Flash Native Audio 12-2025 at 90% (the highest score), and 2.5 Flash Native Audio 09-2025 at 84%.   The third chart is titled "Overall conversational quality." The bars display the following results from left to right: 2.5 Flash Native Audio 12-2025 at 83% (the highest score), and 2.5 Flash Native Audio 09-2025 at 62%.

Expressive speech generation

Craft expressive narratives with granular control over style, tone, and performance using Gemini 2.5 Flash and Pro Text-to-Speech.

Multi-speaker generation

Generate engaging two-person conversations from a single text input. Create podcasts, interviews, or interactive scenarios with distinct character voices.


Live speech translation

Break down language barriers using Gemini’s speech-to-speech translation capabilities.

Language coverage

Gemini’s world knowledge, multilingual capabilities combined with its native audio capabilities allow it to translate speech in over 70 languages and 2000 language pairs.

Style transfer

Preserves the original speakers' intonation, pacing and pitch, adding depth that conveys not just what is said, but how something is spoken.

Multilingual input

Understands multiple languages simultaneously in a single session, to help you follow multilingual conversations without changing any settings.

Automatic language detection

Identifies the languages being spoken and starts translating – so you don’t need to figure it out yourself.

Noise robustness

Filters out ambient noise so you can hold a conversation comfortably – even in loud, outdoor environments.


Audio understanding

Unlock insights directly from audio files with Gemini’s audio capabilities.

A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output. A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output.

Turn audio into data

Transform unstructured audio – like voice notes, support calls, or lectures – into clean, actionable formatted text like JSON, summaries, or lists of actions.

A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream. A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream.

Precise speaker separation

Accurately distinguish and label multiple speakers within a single transcript. Ensuring clarity and correct attribution in interviews, panels, or meetings.

A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text. A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text.

Understand the moment

Capture more than simple words. Gather the sentiment, style of speaking, and all the bits that make speaking human – like laughter.


Safety

We’ve proactively assessed potential risks during every stage of the development process for these native audio features, using what we’ve learned to inform our mitigation strategies. We validate these measures through rigorous internal and external safety evaluations, including comprehensive red teaming for responsible deployment.

All audio outputs from our models are marked with SynthID, our advanced watermarking technology, allowing you to detect whether an audio track has been created or edited using Google AI.


Try Gemini Audio