Built from the ground up to be natively multimodal.


Live voice agents

Enable effective, real-time communication with Gemini 2.5 Flash Native Audio.

Real-time action

Uses tools and call other functions from chat. That means it can use real-time information from sources like Google Search, or even custom developer-built tools, making conversations more practical.

Conversation context awareness

Knows the difference between direct engagement and background chatter. It understands the rhythm of your speech, knowing precisely when to respond—and when to stay silent.

Robust steerability

Maintain specific personas and guidelines throughout long interactions. Even in complex, winding conversations, the model stays on track – and in character.


Expressive speech generation

Craft expressive narratives with granular control over style, tone, and performance using Gemini 2.5 Flash and Pro Text-to-Speech.

Multi-speaker generation

Generate engaging two-person conversations from a single text input. Create podcasts, interviews, or interactive scenarios with distinct character voices.


Live speech translation

Break down language barriers using Gemini’s speech-to-speech translation capabilities.

Language coverage

Gemini’s world knowledge, multilingual capabilities combined with its native audio capabilities allow it to translate speech in over 70 languages and 2000 language pairs.

Style transfer

Gemini preserves the original speakers' intonation, pacing and pitch, adding depth that conveys not just what is said, but how something is spoken. A warm laugh in English will become a warm laugh in Spanish.


Audio understanding

Unlock insights directly from audio files with Gemini’s audio capabilities.

A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output. A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output.

Turn audio into data

Transform unstructured audio – like voice notes, support calls, or lectures – into clean, actionable formatted text like JSON, summaries, or lists of actions.

A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream. A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream.

Precise speaker separation

Accurately distinguish and label multiple speakers within a single transcript. Ensuring clarity and correct attribution in interviews, panels, or meetings.

A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text. A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text.

Understand the moment

Capture more than simple words. Gather the sentiment, style of speaking, and all the bits that make speaking human – like laughter.


Safety

We’ve proactively assessed potential risks during every stage of the development process for these native audio features, using what we’ve learned to inform our mitigation strategies. We validate these measures through rigorous internal and external safety evaluations, including comprehensive red teaming for responsible deployment.

All audio outputs from our models are marked with SynthID, our advanced watermarking technology, allowing you to detect whether an audio track has been created or edited using Google AI.


Try Gemini Audio