Natively multimodal
Understands different modalities and interleaved inputs, eliminating the need for separate embedding models and reducing pipeline complexity.
Maps text, images, videos, audio, and documents into a single, unified embedding space to capture the semantic relationships across data
Understands different modalities and interleaved inputs, eliminating the need for separate embedding models and reducing pipeline complexity.
Anchors agents and applications by providing the deep context necessary for accurate, grounded retrieval.
Powers multimodal search and clustering without requiring users to have cross-modality aligned data.
Supports large input length and uses Matryoshka Representation Learning for flexible output dimensions that maintain accuracy at smaller sizes.
Captures conceptual meaning in over 100 languages, resulting in more consistent representations for cross-lingual tasks.
| Metric type | Metric name | Gemini Embedding 2 | gemini-embedding- 001 Legacy text-only Google model | multimodalembedding @001 Legacy multimodal Google model | Amazon Nova 2 Multimodal Embeddings | Voyage Multimodal 3.5 |
|---|---|---|---|---|---|---|
| Text-Text | MTEB (Multilingual) Mean (Task) | 69.9 | 68.4 | — | 63.8** | 58.5*** |
| MTEB (Code) Mean (Task) | 84.0 | 76.0 | — | * | * | |
| Text-Image | TextCaps recall@1 | 89.6 | — | 74.0 | 76.0 | 79.4 |
| Docci recall@1 | 93.4 | — | — | 84.0 | 83.8 | |
| Image-Text | TextCaps recall@1 | 97.4 | — | 88.1 | 88.9 | 88.6 |
| Docci recall@1 | 91.3 | — | — | 76.5 | 77.4 | |
| Text-Document | ViDoRe v2 ndcg@10 | 64.9 | — | 28.9 | 60.6 | 65.5** |
| Text-Video | Vatex ndcg@10 | 68.8 | — | 54.9 | 60.3 | 55.2 |
| MSR-VTT ndcg@10 | 68.0 | — | 57.9 | 67.0 | 63.0** | |
| Youcook2 ndcg@10 | 52.5 | — | 34.9 | 34.7 | 31.4** | |
| Speech-Text | MSEB mrr@10 | 73.9 | — | — | * | — |
| MSEB (ASR)**** mrr@10 | 70.4 | — | — | * | — |
* score not available
** self-reported
*** voyage-3.5
**** ASR model converts audio queries to text
Surface the most relevant matches across modalities by calculating semantic similarity.
“Empowering our teams to seamlessly search past and present content has increasingly driven us to vector search. While initially seeing great results with traditional large text embeddings (3,072 dim), crowding in vector space quickly took over; the right results couldn't reliably surface their way up from the noise. Gemini's new Embedding 2 model completely changed the game. Text queries can now pinpoint untranscribed micro-expressions, and we can even leverage existing media, such as a photo or B-roll clip, as the search input to instantly retrieve matching video assets. This propelled our text-to-video Recall@1 rate to 85.3%.”
Seth Georgion
VP Technology Innovation, Paramount Skydance
“We chose Gemini Embedding 2 to help legal professionals find critical information during the discovery process in litigation – a highly technical challenge in a high-stakes setting, and one Gemini excels at. In our most recent tests, Gemini's multi-modal embedding model improves precision and recall across millions of records, while unlocking powerful new search functionality for images and videos. For legal professionals, these new capabilities open up entirely novel ways to quickly understand case materials in even the largest matters.”
Max Christoff
CTO, Everlaw
“Gemini Embedding 2 is the foundation for Sparkonomony’s Creative Economic Equality Engine. Its native multi-modality slashes our latency by up to 70% by removing LLM inference and nearly doubles semantic similarity scores for text-image and text-video pairs-leaping from 0.4 to 0.8. This powers our proprietary Creator Genome to index millions of minutes of video, alongside images and text, with unprecedented precision—unlocking unbiased brand collaborations and democratizing economic success for every creator.”
Guneet Singh
Co-founder, Sparkonomy
“The API continuity is excellent. Gemini Embedding 2 drops right into our existing workflow with minimal changes. We're testing new ways to embed text-based conversational memories together with audio and visual embed-dings, especially assistant question-and-answer pairs, and seeing a 20% lift in top-1 recall for our personal wellness app.”
Ertuğrul Çavuşoğlu
Co-founder, Mindlid