General |
MMLU-Pro
Enhanced version of popular MMLU dataset with questions across multiple subjects with higher difficulty tasks
|
Enhanced version of popular MMLU dataset with questions across multiple subjects with higher difficulty tasks |
67.3% |
75.8% |
76.4% |
Code |
Natural2Code
Code generation across Python, Java, C++, JS, Go. Held out dataset HumanEval-like, not leaked on the web
|
Code generation across Python, Java, C++, JS, Go. Held out dataset HumanEval-like, not leaked on the web |
79.8% |
85.4% |
92.9% |
Code |
Bird-SQL (Dev)
Benchmark evaluating converting natural language questions into executable SQL
|
Benchmark evaluating converting natural language questions into executable SQL |
45.6% |
54.4% |
56.9% |
Code |
LiveCodeBench
Code generation in Python. Code Generation subset covering more recent examples: 06/01/2024 - 10/05/2024
|
Code generation in Python. Code Generation subset covering more recent examples: 06/01/2024 - 10/05/2024 |
30.0% |
34.3% |
35.1% |
Factuality |
FACTS Grounding
Ability to provide factuality correct responses given documents and diverse user requests. Held out internal dataset
|
Ability to provide factuality correct responses given documents and diverse user requests. Held out internal dataset |
82.9% |
80.0% |
83.6% |
Math |
MATH
Challenging math problems (incl. algebra, geometry, pre-calculus, and others)
|
Challenging math problems (incl. algebra, geometry, pre-calculus, and others) |
77.9% |
86.5% |
89.7% |
Math |
HiddenMath
Competition-level math problems. Held out dataset AIME/AMC-like, crafted by experts and not leaked on the web
|
Competition-level math problems. Held out dataset AIME/AMC-like, crafted by experts and not leaked on the web |
47.2% |
52.0% |
63.0% |
Reasoning |
GPQA (diamond)
Challenging dataset of questions written by domain experts in biology, physics, and chemistry
|
Challenging dataset of questions written by domain experts in biology, physics, and chemistry |
51.0% |
59.1% |
62.1% |
Long-context |
MRCR (1M)
Novel, diagnostic long-context understanding evaluation
|
Novel, diagnostic long-context understanding evaluation |
71.9% |
82.6% |
69.2% |
Image |
MMMU
Multi-discipline college-level multimodal understanding and reasoning problems
|
Multi-discipline college-level multimodal understanding and reasoning problems |
62.3% |
65.9% |
70.7% |
Image |
Vibe-Eval (Reka)
Visual understanding in chat models with challenging everyday examples. Evaluated with a Gemini Flash model as a rater
|
Visual understanding in chat models with challenging everyday examples. Evaluated with a Gemini Flash model as a rater |
48.9% |
53.9% |
56.3% |
Audio |
CoVoST2 (21 lang)
Automatic speech translation (BLEU score)
|
Automatic speech translation (BLEU score) |
37.4 |
40.1 |
39.2 |
Video |
EgoSchema (test)
Video analysis across multiple domains
|
Video analysis across multiple domains |
66.8% |
71.2% |
71.5% |