The Open Agent Leaderboard
read at source ↗ huggingface.co
The Open Agent Leaderboard
Source: HuggingFace Date: 2026-05-18 URL: https://huggingface.co/blog/ibm-research/open-agent-leaderboard
Summary
IBM Research and HuggingFace launched the Open Agent Leaderboard — a benchmark framework that evaluates full agent systems rather than isolated models. Six benchmarks span coding (SWE-Bench Verified), web research (BrowseComp+), personal task automation (AppWorld), and customer-service scenarios (tau2-Bench). The key findings: open-weight models trail frontier closed-source agents by 18–29 percentage points on average; agent architecture independently affects both performance and cost; and failed runs cost 20–54% more than successful ones, making failure behavior a first-class production concern alongside capability.
Implications
- Open-weight ecosystem. The 18–29pp gap between open-weight and frontier closed-source agents is the clearest quantified benchmark yet of where the open ecosystem actually stands on agentic tasks — not language modeling, but multi-step autonomous work. This number will pressure open-weight model developers (DeepSeek, Mistral, Qwen) to prioritize agent-task training, not just benchmark parity on academic evals.
- Agent-layer convergence. The finding that agent architecture independently affects outcomes — same model, different agent scaffolding, different results — validates the engineering investment in agent frameworks (LangGraph, CrewAI, custom orchestration). The leaderboard creates a formal surface for comparing architectures, not just models.
- Token economics. Failed runs costing 20–54% more than successful ones is a production economics finding that changes how teams should think about agent reliability budgets — failure rate is a cost multiplier, not just a quality metric.