ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
agentsmodelsenterpriseresearch
read at source ↗ huggingface.co
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Source: HuggingFace Date: 2026-05-27 URL: https://huggingface.co/blog/ibm-research/itbench-aa
Summary
ITBench-AA (IBM Research + Artificial Analysis) is the first benchmark targeting agentic enterprise IT operations: agents get shell access to sandboxed Kubernetes incident snapshots and must autonomously diagnose root causes across infrastructure, application, and network failure types. Scoring is harsh — models must identify all root causes or score zero on precision. Best result is Claude Opus at 47%; all frontier models sit below 50%, and more tool-use turns correlates negatively with performance (over-investigation adds false positives).
Implications
- Agent-fleet operability: the sub-50% ceiling on real IT ops tasks is the clearest current signal that agentic reliability in enterprise infra is unsolved — this benchmark has nearly no saturation and directly measures the gap between demo and production.
- Open-weight ecosystem: Gemma 4 31B scores 37% at $0.14/task vs. Gemini 3.1 Pro’s 30% at $2.23/task — open-weight models are already cost-competitive at the tier where frontier closed models underperform anyway.
- Governance/constraint: the “more turns = worse” finding matters for anyone designing agentic loops — unconstrained tool-use budgets can actively degrade correctness on complex multi-root-cause incidents.