BigCodeArena: Judging code generations end to end with code executions
read at source ↗ huggingface.co
BigCodeArena: Judging code generations end to end with code executions
Source: HuggingFace Date: 2025-10-07 URL: https://huggingface.co/blog/bigcode/arena
Summary
Research summary and platform release: BigCodeArena, an execution-first code generation evaluation platform. Key finding: adding code execution context to judge models improves accuracy significantly — Claude-Sonnet-4 judge accuracy goes from 56.7% to 62.3%, GPT-4o from 54.6% to 63.8% on BigCodeReward. AutoCodeArena rankings (600 automated prompts): GPT-5 top, Claude-Opus-4 and Sonnet-4 second tier, then Qwen3-Coder/Kimi-K2/GLM-4.5. Open-source models (Qwen2.5, Llama-3.3-70B) lag proprietary in execution-based rankings.
Implications
Thread: open-weights ecosystem health / agentic patterns. BigCodeArena’s core insight — judges need to see execution output, not just source code — has implications beyond evaluation: it suggests code agents should also use execution feedback in their own quality assessments. The execution environment diversity (React, Streamlit, Gradio, PyGame, Mermaid, etc.) makes this a realistic eval surface for real-world coding assistant tasks. The open-source model gap in execution-based rankings is worth watching: if closed models consistently outperform on code-that-runs vs code-that-reads-well, it signals quality gaps beyond style adherence.