Gemini Audio

Advanced real-time audio models, built on Gemini

Build with Gemini Audio

View developer docs

Live audio agents

Engage in fluid, natural conversation with a model that listens, reasons, and responds in real-time.

Build with Gemini Audio

Expressive audio generation

Craft anything from short snippets to long-form narratives, with granular control over style, tone, and performance.

Build with Gemini Audio

Live speech translation

Translate real-time speech in over 70 languages, while preserving the characteristics of the original speakers. Gemini will figure out which languages are being spoken, and filter out background noise.

Learn more

Audio understanding

Summarize events, extract specific data, and get an outline of context, directly from your audio files with Gemini’s audio understanding.

Build with Gemini Audio

Explore the latest

Built from the ground up to be natively multimodal.

Capabilities
Safety
Try Gemini Audio

Capabilities

Talk, create, and control audio with Gemini Audio models. Engage in fluid, natural conversation or generate expressive audio using simple, natural language prompts.

Live voice agents

Engage in fluid, natural conversation with a model that listens, reasons, and responds in real-time.

Expressive speech generation

Craft anything from short snippets to long-form narratives, with granular control over style, tone, and performance.

Live speech translation

Create direct translations that preserve speech characteristics like intonation, pacing and pitch of the original speaker with Gemini’s speech-to-speech translation capabilities.

Audio understanding

Summarize events, extract specific data, and get an outline of context, directly from your audio files with Gemini’s audio understanding.

Live voice agents

Enable effective, real-time communication with Gemini 2.5 Flash Native Audio.

Real-time action

Uses tools and calls other functions from chat. That means it can use real-time information from sources like Google Search, or even custom developer-built tools, making conversations more practical.

Conversation context awareness

Knows the difference between direct engagement and background chatter. It understands the rhythm of your speech, knowing precisely when to respond—and when to stay silent.

Robust steerability

Maintain specific personas and guidelines throughout long interactions. Even in complex, winding conversations, the model stays on track – and in character.

Delivering sharper function calling, robust instruction following, and smoother conversations to power the next generation of live voice agents.

Expressive speech generation

Craft expressive narratives with granular control over style, tone, and performance using Gemini 2.5 Flash and Pro Text-to-Speech.

Dynamic performance

Bring text to life with expressive readings. Request specific emotions, accents, or styles to match your creative vision.

Multi-speaker generation

Generate engaging two-person conversations from a single text input. Create podcasts, interviews, or interactive scenarios with distinct character voices.

Live speech translation

Break down language barriers using Gemini’s speech-to-speech translation capabilities.

Language coverage

Gemini’s world knowledge, multilingual capabilities combined with its native audio capabilities allow it to translate speech in over 70 languages and 2000 language pairs.

Style transfer

Preserves the original speakers' intonation, pacing and pitch, adding depth that conveys not just what is said, but how something is spoken.

Multilingual input

Understands multiple languages simultaneously in a single session, to help you follow multilingual conversations without changing any settings.

Automatic language detection

Identifies the languages being spoken and starts translating – so you don’t need to figure it out yourself.

Noise robustness

Filters out ambient noise so you can hold a conversation comfortably – even in loud, outdoor environments.

Audio understanding

Unlock insights directly from audio files with Gemini’s audio capabilities.

Turn audio into data

Transform unstructured audio – like voice notes, support calls, or lectures – into clean, actionable formatted text like JSON, summaries, or lists of actions.

A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream.

Precise speaker separation

Accurately distinguish and label multiple speakers within a single transcript. Ensuring clarity and correct attribution in interviews, panels, or meetings.

Understand the moment

Capture more than simple words. Gather the sentiment, style of speaking, and all the bits that make speaking human – like laughter.

Safety

We’ve proactively assessed potential risks during every stage of the development process for these native audio features, using what we’ve learned to inform our mitigation strategies. We validate these measures through rigorous internal and external safety evaluations, including comprehensive red teaming for responsible deployment.

All audio outputs from our models are marked with SynthID, our advanced watermarking technology, allowing you to detect whether an audio track has been created or edited using Google AI.

Learn more about SynthID

Try Gemini Audio

Google AI Studio

The fastest path from prompt to production 

Try in Google AI Studio

Gemini API

Get started with cutting-edge AI models 

Learn more

Gemini Live API

Low-latency, real-time voice and video interactions with Gemini

Learn more