Evals

Explore our comprehensive evaluations across AI capabilities.

SimpleQA Verified

SimpleQA Verified is a 1,000-prompt benchmark for reliably evaluating Large Language Models (LLMs) on short-form factuality and parametric knowledge. The authors from Google DeepMind and Google Research address various limitations of SimpleQA, originally designed by Wei et al. (2024) at OpenAI, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created to provide the research community with a more precise instrument to track genuine progress in factuality, discourage overfitting to benchmark artifacts, and ultimately foster the development of more trustworthy AI systems.

FACTS Grounding

The FACTS Grounding benchmark evaluates the ability of Large Language Models (LLMs) to generate factually accurate responses grounded in provided long-form documents, encompassing a variety of domains. FACTS Grounding moves beyond simple factual question-answering by assessing whether LLM responses are fully grounded to the provided context and correctly synthesize information from a long context document. By providing a standardized evaluation framework, FACTS Grounding aims to promote the development of LLMs that are both knowledgeable and trustworthy, facilitating their responsible deployment in real-world applications.

FACTS Benchmark Suite

Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases. The FACTS Benchmark suite is the first benchmark to holistically evaluate LLM factuality across four dimensions: (1) parametric knowledge, assessing the ability of models to accurately answer factual questions without the aid of external tools; (2) search, evaluating factuality in information-seeking scenarios where the model has access to a standardized web search API; (3) multimodality, measuring the factuality of responses to image-based questions; and (4) grounding, assessing whether long-form responses are grounded in provided documents, replacing the original FACTS Grounding benchmark.

DeepSearch QA

DeepSearchQA is a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single-answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, hand-crafted tasks designed to evaluate an agent’s ability to execute complex search plans to generate exhaustive answer lists. Each task is structured as a "causal chain", where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets.

Chess Text

The Chess Text Input Leaderboard provides a standardized framework for evaluating and comparing the strategic reasoning capabilities of today’s general-purpose language models. Submissions are ranked using a Bayesian skill-rating system that updates regularly, enabling rigorous long-term assessment. This framework not only offers data-driven insights into model performance but also fosters ongoing competition as new models contend for the top ranks.

Chess Text Openings

The Chess Text Openings Leaderboard provides a standardized framework for evaluating and comparing the strategic reasoning capabilities of today’s general-purpose language models. To ensure diverse and realistic gameplay, every match begins from one of the 20 most popular positions that can be reached after two plies (one move by White and one move by Black), based on data from a randomly chosen month on Lichess. This starting-position set reduces memorization effects and highlights how models handle branching play from authentic early-game positions.