Gemini
Our most intelligent AI models
Gemini 2.5 models are capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.
What's new
Model family
Gemini 2.5 builds on the best of Gemini — with native multimodality and a long context window.
Hands-on with 2.5 Pro
See how Gemini 2.5 Pro uses its reasoning capabilities to create interactive simulations and do advanced coding.
Performance
Gemini 2.5 is state-of-the-art across a wide range of benchmarks.
Benchmarks
Gemini 2.5 Pro demonstrates significantly improved performance across a wide range of benchmarks.
Benchmark |
Gemini 2.5 Pro
Preview 06-05
Thinking |
OpenAI o3
High
|
OpenAI o4-mini
High
|
Claude Opus 4
32k thinking
|
Grok 3 Beta
Extended thinking
|
DeepSeek R1
05-28
|
|
---|---|---|---|---|---|---|---|
Input price
|
$/1M tokens (no caching) |
$1.25 $2.50 > 200k tokens |
$10.00 | $1.10 | $15.00 | $3.00 | $0.55 |
Output price
|
$/1M tokens |
$10.00 $15.00 > 200k tokens |
$40.00 | $4.40 | $75.00 | $15.00 | $2.19 |
Reasoning & knowledge
Humanity's Last Exam (no tools)
|
21.6% | 20.3% | 14.3% | 10.7% | — | 14.0%* | |
Science
GPQA diamond
|
single attempt | 86.4% | 83.3% | 81.4% | 79.6% | 80.2% | 81.0% |
|
multiple attempts | — | — | — | 83.3% | 84.6% | — |
Mathematics
AIME 2025
|
single attempt | 88.0% | 88.9% | 92.7% | 75.5% | 77.3% | 87.5% |
|
multiple attempts | — | — | — | 90.0% | 93.3% | — |
Code generation
LiveCodeBench
(UI: 1/1/2025-5/1/2025)
|
single attempt | 69.0% | 72.0% | 75.8% | 51.1% | — | 70.5% |
Code editing
Aider Polyglot
|
82.2%
diff-fenced
|
79.6%
diff
|
72.0%
diff
|
72.0%
diff
|
53.3%
diff
|
71.6%
|
|
Agentic coding
SWE-bench Verified
|
single attempt | 59.6% | — | — | 72.5% | — | — |
|
multiple attempts | 67.2% | 69.1% | 68.1% | 79.4% | — | 57.6% |
Factuality
SimpleQA
|
54.0% | 48.6% | 19.3% | — | 43.6% | 27.8% | |
Factuality
FACTS grounding
|
87.8% | 69.6% | 62.1% | 77.7% | 74.8% | — | |
Visual reasoning
MMMU
|
single attempt | 82.0% | 82.9% | 81.6% | 76.5% | 76.0% | no MM support |
|
multiple attempts | — | — | — | — | 78.0% | no MM support |
Image understanding
Vibe-Eval (Reka)
|
67.2% | — | — | — | — | no MM support | |
Video understanding
VideoMMMU
|
83.6% | — | — | — | — | no MM support | |
Long context
MRCR v2 (8-needle)
|
128k (average) | 58.0% | 57.1% | 36.3% | — | 34.0% | — |
|
1M (pointwise) | 16.4% | no support | no support | no support | no support | no support |
Multilingual performance
Global MMLU (Lite)
|
89.2% | — | — | — | — | — |