Gemini 2.5 Flash

Best for fast performance on everyday tasks

Our powerful and most efficient workhorse model designed for speed and low-cost.

chevron_right

Try in Gemini

chevron_right

Build with Gemini

Capabilities
Performance
Model information

Speed and value at scale

Ideal for tasks like summarization, chat applications, data extraction, and captioning.

async

Thinking budget

Control how much 2.5 Flash reasons to balance latency and cost.

books_movies_and_music

Natively multimodal

Understands input across text, audio, images and video.

stacks

Long context

Explore vast datasets with a 1-million token context window.

Native audio

Converse in more expressive ways with native audio outputs that capture the subtle nuances of how we speak. Seamlessly switch between 24 languages, all with the same voice.

chevron_right

Try in Google AI Studio

chevron_right

Learn more

chat_spark

Natural conversation

Remarkable quality, more appropriate expressivity, and prosody, delivered with low latency so you can converse fluidly.

mic

Style control

Use natural language prompts to adapt the delivery within the conversation, steer it to adopt accents and produce a range of tones and expressions.

code_blocks

Tool integration

Gemini 2.5 can use tools and function calling during dialog allowing it to incorporate real-time information or use custom developer-built tools.

noise_aware

Conversation context awareness

Our system is trained to discern and disregard background speech, ambient conversations and other irrelevant audio.

Performance

Benchmark	Notes	Gemini 2.5 Flash Thinking	Gemini 2.0 Flash	OpenAI o4-mini	Claude Sonnet 3.7 64k Extended thinking	Grok 3 Beta Extended thinking	DeepSeek R1
Input price	$/1M tokens	$0.30	$0.10	$1.10	$3.00	$3.00	$0.55
Output price	$/1M tokens	$2.50	$0.40	$4.40	$15.00	$15.00	$2.19
Reasoning & knowledge Humanity's Last Exam (no tools)		11.0%	5.1%	14.3%	8.9%	—	8.6%*
Science GPQA diamond	single attempt (pass@1)	82.8%	60.1%	81.4%	78.2%	80.2%	71.5%
Science GPQA diamond	multiple attempts	—	—	—	84.8%	84.6%	—
Mathematics AIME 2025	single attempt (pass@1)	72.0%	27.5%	92.7%	49.5%	77.3%	70.0%
Mathematics AIME 2025	multiple attempts	—	—	—	—	93.3%	—
Code generation LiveCodeBench	single attempt (pass@1)	63.9%	34.5%	—	—	70.6%	64.3%
Code editing Aider Polyglot		61.9% / 56.7% whole / diff-fenced	22.2% whole	68.9% / 58.2% whole / diff	64.9% diff	53.3% diff	56.9% diff
Agentic coding SWE-bench Verified		60.4%	—	68.1%	70.3%	—	49.2%
Factuality SimpleQA		26.9%	29.9%	—	—	43.6%	30.1%
Factuality FACTS grounding		85.3%	84.6%	62.1%	78.8%	74.8%	56.8%
Visual reasoning MMMU	single attempt (pass@1)	79.7%	71.7%	81.6%	75.0%	76.0%	no MM support
Visual reasoning MMMU	multiple attempts	—	—	—	—	78.0%	no MM support
Image understanding Vibe-Eval (Reka)		65.4%	56.4%	—	—	—	no MM support
Long context MRCR v2	128k (average)	74.0%	36.0%	49.0%	—	54.0%	45.0%
Long context MRCR v2	1M (pointwise)	32.0%	6.0%	—	—	—	—
Multilingual performance Global MMLU (Lite)		88.4%	83.4%	—	—	—	—

Methodology

Gemini results: All Gemini scores are pass @1 (no majority voting or parallel test time compute unless indicated otherwise). They are all run with the AI Studio API for the model-id gemini-2.5-flash-preview-05-20 and gemini-2.0-flash with default sampling settings. To reduce variance, we average over multiple trials for smaller benchmarks. Vibe-Eval results are reported using Gemini as a judge.

Non-Gemini results: All the results for non-Gemini models are sourced from providers' self reported numbers unless mentioned otherwise below. All SWE-bench Verified numbers follow official provider reports, using different scaffoldings and infrastructure. Google's scaffolding includes drawing multiple trajectories and re-scoring them using model's own judgement.

Thinking vs not-thinking: For Claude 3.7 Sonnet: GPQA, AIME 2024, MMMU come with 64k extended thinking, Aider with 32k, and HLE with 16k. Remaining results come from the non thinking model due to result availability. For Grok-3 all results come with extended reasoning except for SimpleQA (based on xAI reports) and Aider.

Single attempt vs multiple attempts: When two numbers are reported for the same eval higher number uses majority voting with n=64 for Grok models and internal scoring with parallel test time compute for Anthropic models.

Result sources: Where provider numbers are not available we report numbers from leaderboards reporting results on these benchmarks: Humanity's Last Exam results are sourced from https://agi.safe.ai/ and https://scale.com/leaderboard/humanitys_last_exam, AIME 2025 numbers are sourced from https://matharena.ai/. LiveCodeBench results are from https://livecodebench.github.io/leaderboard.html (10/1/2024 - 2/1/2025 in the UI), Aider Polyglot numbers come from https://aider.chat/docs/leaderboards/. FACTS come from https://www.kaggle.com/benchmarks/google/facts-grounding. For MRCR v2 which is not publically available yet we include 128k results as a cumulative score to ensure they can be comparable with previous results and a pointwise value for 1M context window to show the capability of the model at full length.

API costs are sourced from providers' website and are current as of May 20th.

* indicates evaluated on text problems only (without images)

Input and output price reflects text, image and video modalities.

Model information

Name

2.5 Flash

Status

General availability

Input

Output

Input tokens

Output tokens

64k

Knowledge cutoff

January 2025

Tool use

Function calling
Structured output
Search as a tool
Code execution

Best for

Cost-efficient thinking
Well-rounded capabilities

Availability

Gemini app
Google AI Studio
Gemini API
Live API
Vertex AI

Documentation

View developer docs

Model card

View model card

Technical report

View technical report