Gemini Embedding 2

State-of-the-art multimodal embedding model

Maps text, images, videos, audio, and documents into a single, unified embedding space to capture the semantic relationships across data

Performance

State-of-the-art results on a range of cross-modal benchmarks

Metric typeMetric nameGemini
Embedding 2
gemini-embedding-
001 Legacy text-only Google model
multimodalembedding
@001 Legacy multimodal Google model
Amazon Nova 2
Multimodal Embeddings
Voyage
Multimodal 3.5
Text-TextMTEB (Multilingual) Mean (Task)69.968.463.8**58.5***
MTEB (Code) Mean (Task)84.076.0**
Text-ImageTextCaps recall@189.674.076.079.4
Docci recall@193.484.083.8
Image-TextTextCaps recall@197.488.188.988.6
Docci recall@191.376.577.4
Text-DocumentViDoRe v2 ndcg@1064.928.960.665.5**
Text-VideoVatex ndcg@1068.854.960.355.2
MSR-VTT ndcg@1068.057.967.063.0**
Youcook2 ndcg@1052.534.934.731.4**
Speech-TextMSEB mrr@1073.9*
MSEB (ASR)**** mrr@1070.4*

* score not available
** self-reported
*** voyage-3.5
**** ASR model converts audio queries to text

Hands-on

Generate embeddings and explore how you can use them

Multimodal Search with Gemini Embedding 2

Surface the most relevant matches across modalities by calculating semantic similarity.


Model information

Name
Embedding 2
Status
Generally available
Input
  • Text
  • Image
  • Video
  • Audio
  • Documents
Output
Embeddings
Input tokens
8,192
Dimension sizes
128 - 3072
Availability
  • Google Cloud / Vertex AI
  • Gemini API
Documentation
View Gemini API docs
View Google Cloud docs