2026-06-04 · HuggingFace

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

modelsenterpriseresearchcommentary

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Source: HuggingFace Date: 2026-06-04 URL: https://huggingface.co/blog/ServiceNow-AI/eva-bench-data

Summary

ServiceNow-AI released EVA-Bench Data 2.0, an open-source (MIT) benchmark for enterprise voice agents covering 213 evaluation scenarios across 121 tools in three domains: Airline Customer Service Management (50 scenarios), Enterprise IT Service Management (80), and Healthcare HR Service Delivery (83). The dataset roughly quadruples the original release’s scenario coverage and is published via HuggingFace.

Implications

Agentic engineering / eval infrastructure: One of the more rigorous publicly available enterprise-domain agent benchmarks. Multi-tool, multi-domain coverage across realistic service workflows is more useful than single-domain evals for validating production agent pipelines.
Dev tooling / agent frameworks: Domain-specificity of failures (vocabulary, workflow complexity) is the explicit framing — benchmark design acknowledges that a single aggregate score misleads. Methodologically relevant for anyone building eval harnesses.
Worth tracking: as voice-agent and task-agent boundaries blur, this dataset likely becomes a useful prior for tool-call accuracy evaluation beyond strictly voice use cases.

← all signals