2026-02-12 · HuggingFace

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

protocolsagentsresearch

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

Source: HuggingFace Date: 2026-02-12 URL: https://huggingface.co/blog/openenv-turing

Summary

Research/framework tutorial from Meta + HF + Turing: OpenEnv, a real-system (not simulated) agent evaluation framework using MCP interfaces, with Calendar Gym as a concrete testbed. Key finding: agents drop from ~90% success with explicit IDs to ~40% with natural language descriptions — ambiguity is the dominant failure mode. Over 50% of failures are execution quality issues (malformed arguments, wrong ordering) rather than wrong tool selection. Three bottlenecks: multi-step chaining, ambiguity handling, argument formatting.

Implications

Thread: agentic patterns / open-weights ecosystem health. The explicit-ID vs natural-language gap (90% → 40%) is a crucial calibration: agents work well when inputs are machine-readable but fail when they need to resolve references the way humans do. This is a real production blocker, not a benchmark artifact. The OpenEnv + MCP interface design is significant: using MCP as the standard tool interface means evaluation environments built for OpenEnv are directly usable in production MCP setups — no sim-to-real gap. Watch whether Calendar Gym becomes a reference benchmark for agentic tool-use evaluation alongside WebArena and SWE-bench.

← all signals