Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases, so itâs important that their responses are factually accurate.
In order to continue improving their performance on this industry-wide challenge, we have to better understand the types of use cases where models struggle to provide an accurate response and better measure factuality performance in those areas.
Today, weâre teaming up with Kaggle to introduce the FACTS Benchmark Suite. It extends our previous work developing the FACTS Grounding Benchmark, with three additional factuality benchmarks, including:
We are also updating the original FACTS grounding benchmark with Grounding Benchmark - v2, an extended benchmark to test a modelâs ability to provide answers grounded in the context of a given prompt.
Each benchmark was carefully curated to produce a total of 3,513 examples, which we are making publicly available today. Similar to our previous release, we are following standard industry practice and keeping an evaluation set held-out as a private set. The FACTS Benchmark Suite Score (or FACTS Score) is calculated as the average accuracy of both public and private sets across the four benchmarks. Kaggle will oversee the management of the FACTS Benchmark Suite. This includes owning the private held-out sets, testing the leading LLMs on the benchmarks, and hosting the results on a public leaderboard. More details about the FACTS evaluation methodology can be found in our tech report.
The FACTS Parametric benchmark assesses the ability of models to accurately answer factual questions, without the aid of external tools like web search. All the questions in the benchmark are âtrivia styleâ questions driven by user interest that can be answered via Wikipedia (a standard source for LLM pretraining). The resulting benchmark consists of a 1052-item public set and a 1052-item private set.
Distribution of context domain (left) and distribution of the answer type (right) as a percent of the total set of questions in the Parametric benchmark.
A typical prompt from the public set would require the model to answer a simple question on a niche topic, e.g., âWho played harmonica on âThe Rockford Filesâ theme song?â
By contrast, the FACTS Search benchmark evaluates a modelâs ability to use a web search tool for answering questions. This benchmark was designed to be challenging for LLMs even with access to the web, often requiring the retrieval of multiple facts sequentially to answer a single query. The same web search tool is being made available to all models, ensuring the model capabilities are tested in isolation without the confounding factor of custom web retrieval settings. FACTS Search consists of a 890-item public set and a 994-item private set.
Distribution of context domain (left) and distribution of the task requested by the user (right) as a percent of the total set of prompts in the Search benchmark.
The following example from the public set was included because it requires retrieving information from several web pages, âWhat is the sum of the birth years of the British boxer who defeated Vazik Kazarian at the 1960 Summer Olympics, the Moroccan boxer who also competed in the menâs light welterweight event at those same Olympics, and the Danish boxer who competed in both the 1960 and 1964 Summer Olympics?â
The FACTS Multimodal benchmark evaluates the ability of models to generate factually accurate text in response to image-based questions, which is a critical capability for modern multimodal systems.
This task requires the integration of visual grounding, i.e. its ability to accurately interpret and connect information from visual input, using its internal or âparametricâ world knowledge. The evaluation framework is designed to ensure that a response is both correct and provides all necessary information to be complete. The benchmark consists of a 711-item public set and a 811-item private set.
Distribution of image (left) and distribution of the question categories (right) as a part of the Multimodal benchmark.
For example, the following image from the public set of the Multimodal benchmark appeared with the prompt: âWhat genus does this animal belong to?â
An example of an image from the Multimodal benchmark (Photo credit: Image: Racta apella by desertnaturalist, CC BY 4.0)
We evaluated leading LLMs on the FACTS Benchmark Suite, which includes the updated FACTS Grounding v2.
The table below lists 15 leading models and their overall FACTS score (followed by the breakdown to the scores across the four individual benchmarks: Grounding, Multimodal, Parametric and Search).
Gemini 3 Pro leads in overall performance, with a FACTS Score of 68.8%. In particular, we saw significant improvements from Gemini 2.5 Pro to Gemini 3 Pro in Search & Parametric slices, where the error rate was reduced by 55% on FACTS Search and 35% for FACTS Parametric. FACTS Multimodal saw the lowest scores, generally. All evaluated models achieved an overall accuracy below 70%, leaving considerable headroom for future progress.
Beyond the FACTS Benchmark Suite, Geminiâs improvement in factuality is also reflected in another factuality benchmark, SimpleQA Verified, going from 54.5% accuracy on Gemini 2.5 Pro to 72.1% accuracy on Gemini 3 Pro. SimpleQA Verified tests LLMsâ parametric knowledge on short-form responses.
While LLM factuality is still an area of ongoing research, the FACTS Benchmark Suite and Gemini 3 Pro results are representative of Googleâs long-term commitment towards making information universally accessible and useful. We hope this work encourages deeper research into LLM factuality, leading to better and more accurate models and products for the people that rely on them.