2026-02-18 · HuggingFace

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

agentsmodelsenterprisecapitalresearch

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Source: HuggingFace Date: 2026-02-18 URL: https://huggingface.co/blog/ibm-research/itbenchandmast

Summary

Research summary: IBM Research + UC Berkeley applied MAST (Multi-Agent System Failure Taxonomy) to IT automation failures across three models on ITBench. Key findings: stronger models fail in more cascading ways (GPT-OSS-120B: 5.3 failure modes per failed trace vs Gemini-3-Flash: 2.6). Fatal failure modes: unaware of termination conditions (FM-1.5), premature termination (FM-3.1), loss of conversation history (FM-1.4). GPT-OSS-120B shows reasoning-action mismatch (FM-2.6) in 94% of traces and memory loss in 24%.

Implications

Thread: agentic patterns / open-weights ecosystem health. MAST as a failure taxonomy is a practical contribution: it turns “the agent failed” into specific, actionable failure categories. The finding that more capable models fail in more complex, cascading ways (not just less often) reframes the capability-reliability relationship — better models aren’t automatically more reliable, they just fail differently. The GPT-OSS-120B reasoning-action mismatch at 94% of traces is alarming and actionable (context hygiene, early error detection). This taxonomy applies to any multi-step agent evaluation, not just IT automation. Watch whether MAST becomes a standard diagnostic framework for agent failure analysis.

← all signals