2026-04-15 · HuggingFace

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

agentsmodelsresearch

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Source: HuggingFace Date: 2026-04-15 URL: https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis

Summary

IBM Research published VAKRA, an executable benchmark covering 5,187 agent tasks across 8,000+ locally hosted APIs in 62 domains, requiring 3–7-step reasoning chains. Unlike isolated capability tests, VAKRA evaluates full execution traces — tool selection from large API sets (up to 328 tools per domain), argument construction, multi-hop chaining, and hybrid API+document reasoning with explicit policy constraints. The benchmark identifies four distinct failure modes: incorrect tool selection under large candidate sets, argument errors under large parameter counts, argument value errors, and response synthesis failures after correct tool calls. Models that succeed on isolated API chaining break down sharply at 3+ hops and collapse under policy constraints (e.g., “only use document retrievers for technology queries”). No current model handles all four capabilities well simultaneously.

Implications

Agent layer → lifecycle → orchestration thread. VAKRA’s failure taxonomy is a practical design checklist for agentic systems. The tool-selection / argument-error tradeoff (few tools → argument errors; many tools → selection errors) suggests that per-domain tool namespacing — limiting the candidate set at query time rather than presenting all 116+ tools — may matter more than model capability. Policy-constrained multi-source tasks are the hardest category and closest to real enterprise deployments, where routing rules (“use the internal retriever for regulated data”) are ubiquitous.
Enterprise deployment battleground thread. VAKRA formalizes the gap between “can call an API” and “can orchestrate a workflow.” Enterprise buyers asking agents to automate multi-step business processes (credit memos, KYC, compliance checks) face exactly the 3–7-hop multi-source scenarios where current models degrade most sharply. The benchmark gives procurement teams a concrete test class rather than relying on vendor-supplied benchmarks.
Watch: whether orchestration platforms (Symphony, Anthropic Managed Agents) adopt VAKRA-style stage-wise error categorization for their own eval suites, and whether the IBM leaderboard produces model-specific guidance for tool-set sizing.

← all signals