2024-11-19 · HuggingFace

Judge Arena: Benchmarking LLMs as Evaluators

modelsresearch

Judge Arena: Benchmarking LLMs as Evaluators

Source: HuggingFace Date: 2024-11-19 URL: https://huggingface.co/blog/arena-atla

Summary

Benchmark release: Judge Arena, a crowdsourced Elo-rated leaderboard for evaluating LLMs as judges. Users see two LLM judges score the same AI response and vote on which judgment they agree with. 18 models evaluated including GPT-4 Turbo, Claude 3.5 Sonnet, Llama 3.1 (8B/70B/405B), Qwen 2.5 (7B/72B). Early results: GPT-4 Turbo leads narrowly, Llama 3.1 70B second, 405B third — competitive open-source models outperform most proprietary models. Qwen 2.5 7B and Llama 3.1 8B perform well above their parameter count expectations.

Implications

Open-weights ecosystem health. Llama 3.1 70B as the second-best LLM judge in the world is a strong result — the LLM-as-judge use case (automated evaluation pipelines, RLHF reward models, agentic self-evaluation) is a critical infrastructure component, not just a demo task. If open-weights models are competitive judges, teams can replace proprietary judge APIs with local inference.

Model release cadence. Judge Arena results feed into decisions about which models to use for evaluation pipelines. As smaller open-weights models (Qwen 2.5 7B) perform competitively as judges, the cost of high-quality automated evaluation drops — which should accelerate iteration cycles for labs training and fine-tuning models in the open ecosystem.

← all signals