Reasoning with unprecedented depth and nuance
Smart, concise, direct responses – with genuine insight over cliche and flattery.
Our most intelligent model yet. With state-of-the-art reasoning to help you learn, build, and plan anything.
Smart, concise, direct responses – with genuine insight over cliche and flattery.
Text, images, video, audio – even code. Gemini 3 is state-of-the-art on reasoning with unprecedented depth and nuance.
Gemini 3 brings exceptional instruction following – with meaningful improved tool use and agentic coding.
Better tool use. Simultaneous, multi-step tasks. Gemini 3’s agentic capabilities can build more helpful and intelligent personal AI assistants.
Build with our new agentic development platform
Leap from prompt to production
Get started building with cutting-edge AI models
Gemini 3 Pro excels at practical, front-end development – with a more intuitive interface and richer design
Gemini 3 Pro’s state-of-the-art reasoning provides unprecedented nuance and depth
Gemini 3 Pro seamlessly synthesizes information across text, images, video, audio, and even code to help you learn
Our most intelligent model yet sets a new bar for AI model performance.
| Benchmark | Notes | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 |
|---|---|---|---|---|---|
| Academic reasoning Humanity's Last Exam | No tools | 37.5% | 21.6% | 13.7% | 26.5% |
| With search and code execution | 45.8% | — | — | — | |
| Visual reasoning puzzles ARC-AGI-2 | ARC Prize Verified | 31.1% | 4.9% | 13.6% | 17.6% |
| Scientific knowledge GPQA Diamond | No tools | 91.9% | 86.4% | 83.4% | 88.1% |
| Mathematics AIME 2025 | No tools | 95.0% | 88.0% | 87.0% | 94.0% |
| With code execution | 100.0% | — | 100.0% | — | |
| Challenging Math Contest problems MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | |
| Multimodal understanding and reasoning MMMU-Pro | 81.0% | 68.0% | 68.0% | 76.0% | |
| Screen understanding ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | |
| Information synthesis from complex charts CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | |
| OCR OmniDocBench 1.5 | Overall Edit Distance, lower is better | 0.115 | 0.145 | 0.145 | 0.147 |
| Knowledge acquisition from videos Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | |
| Competitive coding problems LiveCodeBench Pro | Elo Rating, higher is better | 2,439 | 1,775 | 1,418 | 2,243 |
| Agentic terminal coding Terminal-Bench 2.0 | Terminus-2 agent | 54.2% | 32.6% | 42.8% | 47.6% |
| Agentic coding SWE-Bench Verified | Single attempt | 76.2% | 59.6% | 77.2% | 76.3% |
| Agentic tool use τ2-bench | 85.4% | 54.9% | 84.7% | 80.2% | |
| Long-horizon agentic tasks Vending-Bench 2 | Net worth (mean), higher is better | $5,478.16 | $573.64 | $3,838.74 | $1,473.43 |
| Held out internal grounding, parametric, MM, and search retrieval benchmarks FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | |
| Parametric knowledge SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | |
| Multilingual Q&A MMMLU | 91.8% | 89.5% | 89.1% | 91.0% | |
| Commonsense reasoning across 100 Languages and Cultures Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | |
| Long context performance MRCR v2 (8-needle) | 128k (average) | 77.0% | 58.0% | 47.1% | 61.6% |
| 1M (pointwise) | 26.3% | 16.4% | not supported | not supported |
For details on our evaluation methodology please see deepmind.google/models/evals-methodology/gemini-3-pro