2025-12-09 · Google

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

models

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

Source: DeepMind Date: 2025-12-09 URL: https://deepmind.google/blog/facts-benchmark-suite-systematically-evaluating-the-factuality-of-large-language-models/

Summary

Google DeepMind and Kaggle released the FACTS Benchmark Suite, a four-dimensional factuality evaluation framework: Parametric (2,104 examples, internal knowledge), Search (1,884 examples, multi-step retrieval), Multimodal (1,522 examples, image-grounded QA), and Grounding v2 (context-grounded). Gemini 3 Pro scored highest at 68.8%, with a 55% error rate reduction on search tasks versus Gemini 2.5 Pro. All evaluated models scored below 70%, indicating factuality remains an unsolved problem across the industry.

Implications

68.8% as the industry-best factuality score is an honest disclosure that factuality is unsolved. No model cleared 70% across all four dimensions. Publishing a benchmark where your own best model scores 68.8% is Google saying: we know this is a gap, here’s a rigorous way to measure it, and we’re not hiding it. That’s a different posture than launching benchmarks designed to make your models look good.

The 55% search task error reduction is the grounding claim worth extracting. Search-grounded factuality — retrieving information and synthesizing a factual answer — is the core capability for enterprise RAG applications. A 55% error reduction from 2.5 Pro to 3 Pro on that specific dimension is the signal that the grounding stack improved substantially, not just overall benchmark scores.

Four-dimensional factuality is the right framework for different failure modes. Parametric failure (wrong internal knowledge), search failure (bad retrieval or synthesis), multimodal failure (misread image), and grounding failure (hallucinating beyond context) are different bugs requiring different fixes. A single factuality score obscures which type is failing. The FACTS Suite’s structure is the contribution — splitting by failure mode is how you build targeted mitigations.

Watch:

Adoption of FACTS Suite by independent evaluators (HELM, LMSYS, METR) as a standard factuality surface — does it become the factuality benchmark or remain Google-internal?
Progress on multimodal factuality specifically — if all models scored below 70%, the image-grounded QA dimension is probably the weakest, and that’s where grounding investment should flow
Whether the Kaggle public/private split holds against contamination over time — private sets that stay private are more valuable than those that leak into training data

← all signals