Our most advanced text-to-image technology
Imagen 2 is our most advanced text-to-image diffusion technology, delivering high-quality, photorealistic outputs that are closely aligned and consistent with the user’s prompt. It can generate more lifelike images by using the natural distribution of its training data, instead of adopting a pre-programmed style.
Imagen 2’s powerful text-to-image technology is available in Gemini, Search Generative Experience and and a Google Labs experiment called ImageFX. This offers an innovative interface that allows users to quickly explore alternative prompts and expand the bounds of their creativity.
The Google Arts and Culture team is also deploying our Imagen 2 technology in their Cultural Icons experiment, allowing users to explore, learn and test their cultural knowledge with the help of Google AI.
Developers and Cloud customers can access it via the Imagen API in Google Cloud Vertex AI.
Improved image-caption understanding
Text-to-image models learn to generate images that match a user’s prompt from details in their training datasets’ images and captions. But the quality of detail and accuracy in these pairings can vary widely for each image and caption.
To help create higher-quality and more accurate images that better align to a user’s prompt, we added further descriptions to image captions in Imagen 2’s training dataset, helping Imagen 2 learn different captioning styles and generalize to better understand a broad range of user prompts.
These enhanced image-caption pairings help Imagen 2 better understand the relationship between images and words — increasing its understanding of context and nuance.
Here are examples of Imagen 2’s prompt understanding:
More realistic image generation
Imagen 2’s dataset and model advances have delivered improvements in many of the areas that text-to-image tools often struggle with, including rendering realistic hands and human faces and minimizing distracting visual artifacts.
We trained a specialized image aesthetics model based on human preferences for qualities like good lighting, framing, exposure, sharpness, and more. Each image was given an aesthetics score which helped condition Imagen 2 to give more weight to images in its training dataset that align with qualities humans prefer. This technique improves Imagen 2’s ability to generate higher-quality images.
Fluid style conditioning
Imagen 2’s diffusion-based techniques provide a high degree of flexibility, making it easier to control and adjust the style of an image. By providing reference style images in combination with a text prompt, we can condition Imagen 2 to generate new imagery that follows the same style.
Advanced inpainting and outpainting
Imagen 2 also enables image editing capabilities like ‘inpainting’ and ‘outpainting’. By providing a reference image and an image mask, users can generate new content directly into the original image with a technique called inpainting, or extend the original image beyond its borders with outpainting. This technology is planned for Google Cloud’s Vertex AI over the course of 2024.
Responsible by design
To help mitigate the potential risks and challenges of our text-to-image generative technology, we set robust guardrails in place, from design and development to deployment in our products.
Imagen 2 is integrated with SynthID, our cutting-edge toolkit for watermarking and identifying AI-generated content, enabling allowlisted Google Cloud customers to add an imperceptible digital watermark directly into the pixels of the image, without compromising image quality. This allows the watermark to remain detectable by SynthID, even after applying modifications like filters, cropping, or saving with lossy compression schemes.
Before we release capabilities to users, we conduct robust safety testing to minimize the risk of harm. From the outset, we invested in training data safety for Imagen 2, and added technical guardrails to limit problematic outputs like violent, offensive, or sexually explicit content. We apply safety checks to training data, input prompts, and system-generated outputs at generation time. For example, we’re applying comprehensive safety filters to avoid generating potentially problematic content, such as images of named individuals. As we are expanding the capabilities and launches of Imagen 2, we are also continuously evaluating them for safety.
This work was made possible by key research and engineering contributions from:
Aäron van den Oord, Ali Razavi, Benigno Uria, Çağlar Ünlü, Charlie Nash, Chris Wolff, Conor Durkan, David Ding, Dawid Górny, Evgeny Gladchenko, Felix Riedel, Hang Qi, Jacob Kelly, Jakob Bauer, Jeff Donahue, Junlin Zhang, Mateusz Malinowski, Mikołaj Bińkowski, Pauline Luc, Robert Riachi, Robin Strudel, Sander Dieleman, Tobenna Peter Igwe, Yaroslav Ganin, Zach Eaton-Rosen.
Thanks to: Ben Bariach, Dawn Bloxwich, Ed Hirst, Elspeth White, Gemma Jennings, Jenny Brennan, Komal Singh, Luis C. Cobo, Miaosen Wang, Nick Pezzotti, Nicole Brichtova, Nidhi Vyas, Nina Anderson, Norman Casagrande, Sasha Brown, Sven Gowal, Tulsee Doshi, Will Hawkins, Yelin Kim, Zahra Ahmed for driving delivery; Douglas Eck, Nando de Freitas, Oriol Vinyals, Eli Collins, Demis Hassabis for their advice.
Thanks also to many others who contributed across Google DeepMind, including our partners in Google.