2024-11-19 · Anthropic

A statistical approach to model evaluations

protocolsmodelsresearch

A statistical approach to model evaluations

Source: Anthropic Research Date: 2024-11-19 URL: https://www.anthropic.com/research/statistical-approach-to-model-evals

Summary

Benchmark scores between frontier models often differ by less than the statistical noise in the evaluation — clustered standard errors on popular benchmarks can be 3x larger than naive reported errors. Applies Central Limit Theorem, paired-difference testing, and power analysis to AI evals; finds frontier models correlate 0.3–0.7 on question scores, enabling variance reduction through paired testing. Calls for reporting confidence intervals as standard practice.

Implications

This lands squarely in the evaluation methodology thread — the “science of evals” framing Anthropic has been pushing. The practical implication: most claimed model ranking differences in leaderboards are statistically indistinguishable under rigorous analysis. This feeds directly into Anthropic’s model card and eval transparency posture, and gives cover for “we can’t definitively say X beats Y” stances. Watch whether this methodology gets adopted by third-party eval orgs (HELM, Eleuther) or gets weaponized by any lab when a competitor claims superiority on a thin margin.

← all signals