2025-02-14 · HuggingFace

Fixing Open LLM Leaderboard with Math-Verify

modelsresearchinfrastructure

Fixing Open LLM Leaderboard with Math-Verify

Source: HuggingFace Date: 2025-02-14 URL: https://huggingface.co/blog/math_verify_leaderboard

Summary

Hugging Face’s Math-Verify replaces the broken answer-extraction and comparison logic in the Open LLM Leaderboard’s MATH-Hard evaluation, which had been marking correct model outputs as wrong when they didn’t match an expected format or used valid but unanticipated mathematical representations. Re-running Math-Verify across all 3,751 submitted models produced an average score gain of 4.66 points; Qwen model families doubled their scores and DeepSeek families nearly tripled theirs. The top-20 leaderboard reshuffled entirely, with NVIDIA’s AceMath and Qwen derivatives taking dominant positions.

Implications

Feeds the evaluation reliability thread: a widely-cited public benchmark was producing systematically wrong results for over a year; the scope of the correction — thousands of models, top-20 reshuffled — illustrates how fragile evals are to implementation details that seem mundane.
Directly relevant to any decision that used MATH-Hard leaderboard rankings to select or compare models: those decisions may have been made on corrupted data.
Broader signal: as math reasoning becomes a key proxy for general capability, the correctness of math-eval tooling is load-bearing infrastructure, not a secondary concern.

← all signals