Speech generation
Craft anything from short snippets to long-form speeches, with granular control over style, pace, delivery and performance
Craft anything from short snippets to long-form speeches, with granular control over style, pace, delivery and performance
Best for directing intonation and inflection. Intuitive audio tags give you granular control over style, pace, and tone with unprecedented control.
Use intuitive audio tags to command style, pace, and delivery with unprecedented precision.
Turn static scripts into emotive conversations, by matching vocal delivery to the moment. Direct the emotion, pace, and tone of the piece to transform routine updates – like weather reports – into engaging performances.
Generate captivating conversations from a single text input. Create podcasts, interviews, or scenes from a play with distinct character voices.
At StyleUAI, we power AI-driven fashion experiences for brands like Vero Moda, Columbia, and Nike. With Gemini 3.1 Flash TTS, our AI Smart Mirror finally sounds like a real personal stylist - not a voice assistant. The inline audio tags are a game changer: we can now deliver styling recommendations that feel expressive, warm, and on-brand. [enthusiastic] when presenting a bold look, [calm] and considered for a classic fit - it’s the kind of nuance that makes shoppers actually listen.
At Artlist, giving our users full creative control over every part of the creation process is essential. With Google’s latest TTS model, they can now direct not just what is said, but how it’s performed - adding style instructions and expressive tags to shape the delivery. Defining how a character should feel, and exactly when they should sound happy or sad, is now possible. It puts creative direction directly in the user’s hands and is a meaningful step toward making synthetic speech feel dynamic and production-ready.
Our goal is simple: for agents built on Sierra to deliver great customer experiences. Using a 'constellation of models' enables us to select the best model for each part of a conversation. In our testing, we were impressed by how natural and expressive gemini-3.1-flash-tts was, from tone and pacing to accent variation. These qualities are critical to delivering better, more human customer experiences.
With HeyGen, anyone can create an AI avatar of themselves. For our European market, natural multilingual speech has been the missing piece. Gemini TTS is changing the game—the quality of its French, German, and Portuguese is remarkable, bringing a step change improvement to avatar speech.
Two things impressed me about Gemini 3.1 Flash TTS: the pacing is natural — not the rushed cadence you get from most TTS models — and it handles vocal expression instructions remarkably well. You can steer tone and delivery through prompts and get something that sounds intentional, not generated.
We’ve been testing the model across real-world, high-concurrency voice experiences. The integrated audio tags are a key differentiator, enabling precise control over tone and pacing while leveraging dialectal prosody for personalization. This makes it possible to scale regionally without compromising user experience. The model shows strong instruction alignment and clearly outperforms other solutions we’ve evaluated.
We've seen clear improvements across the board - expressivity, audio style instruction following, and multilingual performance all stand out. The model handles different dialects more consistently and delivers speech that's conversational and natural. It's been a meaningful upgrade for us.
Compared to tools like ElevenLabs, Gemini stands out in control and expressiveness. Audio tags are a big unlock. Inline audio tags like [whispers], [shouting], etc. are a big unlock — especially powerful for long-form, agent-driven experiences.
Gemini 3.1 Flash TTS made Mindlid’s structured reflections feel more human... Inline audio tags were especially valuable for shaping pauses, pacing, and subtle expressive transitions.
Source: Artificial Analysis Text to Speech (TTS), data as of April 15, 2026.
Models are ranked using an Elo rating system derived from user votes in blind comparison in the Speech Arena. Users listen to pairs of speech samples generated from the same text and choose which sounds more natural. Higher Elo scores indicate a model produces speech preferred more often by listeners.
The fastest path from prompt to production
AI-powered video creation for work
Get started building with cutting-edge AI models
Build, scale, and govern agents