Gaia2 and ARE: Empowering the community to study agents
read at source ↗ huggingface.co
Gaia2 and ARE: Empowering the community to study agents
Source: HuggingFace Date: 2025-09-22 URL: https://huggingface.co/blog/gaia2
Summary
Research benchmark release: Meta and HF launch GAIA2, an upgraded agentic evaluation benchmark with 1,000+ human-created scenarios covering read-write tasks, multi-step tool use, ambiguity handling, temporal reasoning, and agent-to-agent collaboration. Paired with ARE (Agent Research Environments), an open-source framework with a smartphone mock-up environment and structured trace recording. Results: GPT-5 with high reasoning leads; Kimi K2 is best open-source; time-sensitive reasoning remains the hardest category. Both CC BY 4.0 (dataset) and MIT (framework) licensed.
Implications
Open-weights ecosystem health. Kimi K2 ranking as the top open-source performer on GAIA2 is a concrete benchmark anchor — a single number that separates genuinely capable open-weights agents from those that pass simpler GAIA1 tasks. Expect future open-weights agent releases to cite GAIA2 as a target.
Model release cadence — agent thread. ARE’s trace recording and customizable scenario infrastructure makes agent evaluation reproducible and forkable — closing the feedback loop that previously required expensive human evaluation. This lowers the cost of iterating on open-weights agent training, which should accelerate the cadence of agent model releases.
HF as open-source ML hub. HF co-hosting GAIA2 dataset (CC BY 4.0) and the ARE framework (MIT) alongside the evaluation tooling reinforces HF as the substrate for open agentic research — not just model weights.