2026-02-04 · HuggingFace

Community Evals: Because we're done trusting black-box leaderboards over the community

modelsresearch

Community Evals: Because we’re done trusting black-box leaderboards over the community

Source: HuggingFace Date: 2026-02-04 URL: https://huggingface.co/blog/community-evals

Summary

Platform feature release: HF Hub Community Evals — a decentralized, transparent evaluation reporting system. Model repos store eval scores in .eval_results/*.yaml; benchmark dataset repos define eval specs via eval.yaml (Inspect AI format) and auto-aggregate reported results. Any user can submit evaluation results via PR (shown as “community” immediately, before merging). Motivation: benchmark saturation (MMLU 91%+, GSM8K 94%+, HumanEval “conquered”) and conflicting scores across reporting sources. Live benchmarks: MMLU-Pro, GPQA, HLE.

Implications

HF as open-source ML hub. Community Evals is HF’s answer to the reproducibility and trust problem with model evaluation — instead of relying on self-reported scores or proprietary leaderboard runs, it routes evaluation data through a Git-tracked, community-submittable system. Full Git history for every score change makes benchmark gaming detectable in a way that black-box leaderboards cannot provide.

Open-weights ecosystem health. Benchmark saturation (MMLU >91%) has reached the point where traditional benchmarks can no longer differentiate models — the push toward harder evals (GPQA, HLE, domain-specific benchmarks like AssetOpsBench) reflects this. Community Evals infrastructure that supports any benchmark, not just the ones HF maintains, creates the tooling needed for the community to establish new evaluation standards faster than HF’s team can build them.

← all signals