Gemini Audio

Speech generation

Craft anything from short snippets to long-form speeches, with granular control over style, pace, delivery and performance

Models
Capabilities
Showcase
Performance
Model information
Try Speech generation

Models

Build agents capable of realistic speech with professional-grade audio and low latency – ready for deployment at any scale.

3.1 Flash TTS

Best for directing intonation and inflection. Intuitive audio tags give you granular control over style, pace, and tone with unprecedented control.

Try it in

Google AI Studio

Capabilities

Generate speech with precise control to create emotive narratives – or create entire conversations featuring multiple speakers.

Granular expressive control

Use intuitive audio tags to command style, pace, and delivery with unprecedented precision.

Immersive storytelling

Turn static scripts into emotive conversations, by matching vocal delivery to the moment. Direct the emotion, pace, and tone of the piece to transform routine updates – like weather reports – into engaging performances.

Multi-speaker generation

Generate captivating conversations from a single text input. Create podcasts, interviews, or scenes from a play with distinct character voices.

Showcase

At StyleUAI, we power AI-driven fashion experiences for brands like Vero Moda, Columbia, and Nike. With Gemini 3.1 Flash TTS, our AI Smart Mirror finally sounds like a real personal stylist - not a voice assistant. The inline audio tags are a game changer: we can now deliver styling recommendations that feel expressive, warm, and on-brand. [enthusiastic] when presenting a bold look, [calm] and considered for a classic fit - it’s the kind of nuance that makes shoppers actually listen.

Jay

Founder, StyleUAI

At Artlist, giving our users full creative control over every part of the creation process is essential. With Google’s latest TTS model, they can now direct not just what is said, but how it’s performed - adding style instructions and expressive tags to shape the delivery. Defining how a character should feel, and exactly when they should sound happy or sad, is now possible. It puts creative direction directly in the user’s hands and is a meaningful step toward making synthetic speech feel dynamic and production-ready.

Idan Yonas

Director of AI Content & Innovation, Artlist

Our goal is simple: for agents built on Sierra to deliver great customer experiences. Using a 'constellation of models' enables us to select the best model for each part of a conversation. In our testing, we were impressed by how natural and expressive gemini-3.1-flash-tts was, from tone and pacing to accent variation. These qualities are critical to delivering better, more human customer experiences.

Lydia Xu

Software Engineer, Sierra

With HeyGen, anyone can create an AI avatar of themselves. For our European market, natural multilingual speech has been the missing piece. Gemini TTS is changing the game—the quality of its French, German, and Portuguese is remarkable, bringing a step change improvement to avatar speech.

John Wu

Engineering Manager, HeyGen

Two things impressed me about Gemini 3.1 Flash TTS: the pacing is natural — not the rushed cadence you get from most TTS models — and it handles vocal expression instructions remarkably well. You can steer tone and delivery through prompts and get something that sounds intentional, not generated.

Shivam Rastogi

VP, Engineering, Invideo AI

We’ve been testing the model across real-world, high-concurrency voice experiences. The integrated audio tags are a key differentiator, enabling precise control over tone and pacing while leveraging dialectal prosody for personalization. This makes it possible to scale regionally without compromising user experience. The model shows strong instruction alignment and clearly outperforms other solutions we’ve evaluated.

Fernanda Bejarano

Head of Product & UX, biia

We've seen clear improvements across the board - expressivity, audio style instruction following, and multilingual performance all stand out. The model handles different dialects more consistently and delivers speech that's conversational and natural. It's been a meaningful upgrade for us.

Soami Kapadia

Co-Founder, YouLearn.AI

Compared to tools like ElevenLabs, Gemini stands out in control and expressiveness. Audio tags are a big unlock. Inline audio tags like [whispers], [shouting], etc. are a big unlock — especially powerful for long-form, agent-driven experiences.

Angel Wen

Sylph.ai

Gemini 3.1 Flash TTS made Mindlid’s structured reflections feel more human... Inline audio tags were especially valuable for shaping pauses, pacing, and subtle expressive transitions.

Ertuğrul Çavuşoğlu

CEO, Mindlid

Performance

Our speech generation models deliver impressively fast speech generation without compromising on vocal stability or expressive quality.

Source: Artificial Analysis Text to Speech (TTS), data as of April 15, 2026.

Artificial Analysis Text to Speech (TTS) Arena Quality Elo

Models are ranked using an Elo rating system derived from user votes in blind comparison in the Speech Arena. Users listen to pairs of speech samples generated from the same text and choose which sounds more natural. Higher Elo scores indicate a model produces speech preferred more often by listeners.

Model information

Name

3.1 Flash TTS

Status

Preview

Input

Output

Input tokens

16k

Output tokens

32k

Knowledge cutoff

January 2025

Availability

Google AI Studio
Gemini API
Gemini Enterprise Agent Platform
Google Vids

Documentation

View developer docs

Model card

View model card

Try Speech generation

Google AI Studio

The fastest path from prompt to production 

Try in Google AI Studio

Google Vids

AI-powered video creation for work

Try in Google Vids

Gemini API

Get started building with cutting-edge AI models

Learn more

Gemini Enterprise Agent Platform

Build, scale, and govern agents

Learn more

Explore our next generation AI systems

Our latest AI breakthroughs and updates from the lab

Unlocking a new era of discovery with AI

Our mission is to build AI responsibly to benefit humanity

Speech generation

Models

3.1 Flash TTS

Try it in

Capabilities

Granular expressive control

Immersive storytelling

Multi-speaker generation

Showcase

Performance

Artificial Analysis Text to Speech (TTS) Arena Quality Elo

Model information

Try Speech generation

Google AI Studio

Google Vids

Gemini API

Gemini Enterprise Agent Platform