Gemini 2.5 Pro

Best for coding and highly complex tasks

Gemini 2.5 Pro is our most advanced model yet, excelling at coding and complex prompts.



Vibe-coding nature with 2.5 Pro

Images transformed into code-based representations of its natural behavior.



Performance

Gemini 2.5 Pro leads common benchmarks by meaningful margins.

Benchmark Notes Gemini 2.5 Pro Thinking OpenAI o3 High OpenAI o4-mini High Claude Opus 4 32k thinking Grok 3 Beta Extended thinking DeepSeek R1 05-28
Input price $/1M tokens (no caching) $1.25 $2.50 > 200k tokens $10.00 $1.10 $15.00 $3.00 $0.55
Output price $/1M tokens $10.00 $15.00 > 200k tokens $40.00 $4.40 $75.00 $15.00 $2.19
Reasoning & knowledge Humanity's Last Exam (no tools) 21.6% 20.3% 14.3% 10.7% 14.0%*
Science GPQA diamond single attempt 86.4% 83.3% 81.4% 79.6% 80.2% 81.0%
multiple attempts 83.3% 84.6%
Mathematics AIME 2025 single attempt 88.0% 88.9% 92.7% 75.5% 77.3% 87.5%
multiple attempts 90.0% 93.3%
Code generation LiveCodeBench (UI: 1/1/2025-5/1/2025) single attempt 69.0% 72.0% 75.8% 51.1% 70.5%
Code editing Aider Polyglot 82.2% diff-fenced 79.6% diff 72.0% diff 72.0% diff 53.3% diff 71.6%
Agentic coding SWE-bench Verified single attempt 59.6% 69.1% 68.1% 72.5%
multiple attempts 67.2% 79.4% 57.6%
Factuality SimpleQA 54.0% 48.6% 19.3% 43.6% 27.8%
Factuality FACTS grounding 87.8% 69.6% 62.1% 77.7% 74.8%
Visual reasoning MMMU single attempt 82.0% 82.9% 81.6% 76.5% 76.0% no MM support
multiple attempts 78.0% no MM support
Image understanding Vibe-Eval (Reka) 67.2% no MM support
Video understanding VideoMMMU 83.6% no MM support
Long context MRCR v2 (8-needle) 128k (average) 58.0% 57.1% 36.3% 34.0%
1M (pointwise) 16.4% no support no support no support no support no support
Multilingual performance Global MMLU (Lite) 89.2%

Methodology

Gemini results: All Gemini scores are pass @1."Single attempt" settings allow no majority voting or parallel test-time compute; "multiple attempts" settings allow test-time selection of the candidate answer. They are all run with the AI Studio API for the model-id gemini-2.5-pro-preview-06-05 with default sampling settings. To reduce variance, we average over multiple trials for smaller benchmarks. Aider Polyglot score is the pass rate average of 3 trials. Vibe-Eval results are reported using Gemini as a judge.

Non-Gemini results: All the results for non-Gemini models are sourced from providers' self reported numbers unless mentioned otherwise below.

All SWE-bench Verified numbers follow official provider reports, using different scaffoldings and infrastructure. Google's scaffolding for "multiple attempts" for SWE-Bench includes drawing multiple trajectories and re-scoring them using model's own judgement.

Thinking vs not-thinking: For Claude 4 results are reported for the reasoning model where available (HLE, LCB, Aider). For Grok-3 all results come with extended reasoning except for SimpleQA (based on xAI reports) and Aider. For OpenAI models high level of reasoning is shown where results are available (except for GPQA, AIME 2025, SWE-Bench, FACTS, MMMU).

Single attempt vs multiple attempts: When two numbers are reported for the same eval higher number uses majority voting with n=64 for Grok models and internal scoring with parallel test time compute for Anthropic models.

Result sources: Where provider numbers are not available we report numbers from leaderboards reporting results on these benchmarks: Humanity's Last Exam results are sourced from https://agi.safe.ai/ and https://scale.com/leaderboard/humanitys_last_exam, AIME 2025 numbers are sourced from https://matharena.ai/. LiveCodeBench results are from https://livecodebench.github.io/leaderboard.html (1/1/2025 - 5/1/2025 in the UI), Aider Polyglot numbers come from https://aider.chat/docs/leaderboards/. FACTS come from https://www.kaggle.com/benchmarks/google/facts-grounding. For MRCR v2 which is not publicly available yet we include 128k results as a cumulative score to ensure they can be comparable with other models and a pointwise value for 1M context window to show the capability of the model at full length. The methodology has changed in this table vs previously published results for MRCR v2 as we have decided to focus on a harder, 8-needle version of the benchmark going forward.

API costs are sourced from providers' website and are current as of June 5th.

* indicates evaluated on text problems only (without images)

Input and output price reflects text, image and video modalities.


Model information

Name
2.5 Pro
Status
General availability
Input
  • Text
  • Image
  • Video
  • Audio
  • PDF
Output
  • Text
Input tokens
1M
Output tokens
64k
Knowledge cutoff
January 2025
Tool use
  • Function calling
  • Structured output
  • Search as a tool
  • Code execution
Best for
  • Reasoning
  • Coding
  • Complex prompts
Availability
  • Gemini app
  • Google AI Studio
  • Gemini API
  • Vertex AI
Documentation
View developer docs
Model card
View model card
Technical report
View technical report