Ultra
Welcome to
the Gemini era
Gemini is built from the ground up for multimodality — reasoning seamlessly across text, images, video, audio, and code.
Meet the first version of Gemini — our most capable AI model.
Gemini Ultra
90.0%
CoT@32*
89.8%
Human expert(MMLU)
86.4%
5-shot* (reported)
Previous SOTA (GPT-4)
*Note that evaluations of previous SOTA models use different prompting techniques.
Gemini is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), one of the most popular methods to test the knowledge and problem solving abilities of AI models.
Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding.
TEXT
| TEXT | Capability | Benchmark Higher is better | Description | Gemini Ultra | GPT-4API numbers calculated where reported numbers were missing | |
|---|---|---|---|---|---|---|
General | MMLURepresentation of questions in 57 subjects (incl. STEM, humanities, and others) | Representation of questions in 57 subjects (incl. STEM, humanities, and others) | 90%CoT@32* | 86.4%5-shot** (reported) | ||
Reasoning | Big-Bench HardDiverse set of challenging tasks requiring multi-step reasoning | Diverse set of challenging tasks requiring multi-step reasoning | 83.6%3-shot | 83.1%3-shot (API) | ||
DROPReading comprehension (F1 Score) | Reading comprehension (F1 Score) | 82.4Variable shots | 80.93-shot (reported) | |||
HellaSwagCommonsense reasoning for everyday tasks | Commonsense reasoning for everyday tasks | 87.8%10-shot* | 95.3%10-shot* (reported) | |||
Math | GSM8KBasic arithmetic manipulations (incl. Grade School math problems) | Basic arithmetic manipulations (incl. Grade School math problems) | 94.4%maj1@32 | 92%5-shot CoT (reported) | ||
MATHChallenging math problems (incl. algebra, geometry, pre-calculus, and others) | Challenging math problems (incl. algebra, geometry, pre-calculus, and others) | 53.2%4-shot | 52.9%4-shot (API) | |||
Code | HumanEvalPython code generation | Python code generation | 74.4%0-shot (IT)* | 67%0-shot* (reported) | ||
Natural2CodePython code generation. New held out dataset HumanEval-like, not leaked on the web | Python code generation. New held out dataset HumanEval-like, not leaked on the web | 74.9%0-shot | 73.9%0-shot (API) |
*See the technical report for details on performance with other methodologies
**GPT-4 scores 87.29% with CoT@32—see the technical report for full comparison
Gemini surpasses state-of-the-art performance on a range of multimodal benchmarks.
MULTIMODAL
| MULTIMODAL | Capability | Benchmark | Description Higher is better unless otherwise noted | Gemini | GPT-4VPrevious SOTA model listed when capability is not supported in GPT-4V | |
|---|---|---|---|---|---|---|
Image | MMMUMulti-discipline college-level reasoning problems | Multi-discipline college-level reasoning problems | 59.4%0-shot pass@1 Gemini Ultra (pixel only*) | 56.8%0-shot pass@1 GPT-4V | ||
VQAv2Natural image understanding | Natural image understanding | 77.8%0-shot Gemini Ultra (pixel only*) | 77.2%0-shot GPT-4V | |||
TextVQAOCR on natural images | OCR on natural images | 82.3%0-shot Gemini Ultra (pixel only*) | 78%0-shot GPT-4V | |||
DocVQADocument understanding | Document understanding | 90.9%0-shot Gemini Ultra (pixel only*) | 88.4%0-shot GPT-4V (pixel only) | |||
Infographic VQAInfographic understanding | Infographic understanding | 80.3%0-shot Gemini Ultra (pixel only*) | 75.1%0-shot GPT-4V (pixel only) | |||
MathVistaMathematical reasoning in visual contexts | Mathematical reasoning in visual contexts | 53%0-shot Gemini Ultra (pixel only*) | 49.9%0-shot GPT-4V | |||
Video | VATEXEnglish video captioning (CIDEr) | English video captioning (CIDEr) | 62.74-shot Gemini Ultra | 564-shot DeepMind Flamingo | ||
Perception Test MCQAVideo question answering | Video question answering | 54.7%0-shot Gemini Ultra | 46.3%0-shot SeViLA | |||
Audio | CoVoST 2 (21 languages)Automatic speech translation (BLEU score) | Automatic speech translation (BLEU score) | 40.1Gemini Pro | 29.1Whisper v2 | ||
FLEURS (62 languages)Automatic speech recognition (based on word error rate, lower is better) | Automatic speech recognition (based on word error rate, lower is better) | 7.6%Gemini Pro | 17.6%Whisper v3 |
*Gemini image benchmarks are pixel only—no assistance from OCR systems
Gemini comes in three sizes
Our most capable and largest model for highly-complex tasks.
Pro
Our best model for scaling across a wide range of tasks.
Nano
Our most efficient model for on-device tasks.
Hands-on with Gemini
Watch highlights from our testing of Gemini’s multimodal reasoning capabilities. Curious to learn more? Explore our prompting techniques here.
See more of what #GeminiAI can do













