2025-02-04 · HuggingFace

DABStep: Data Agent Benchmark for Multi-step Reasoning

agentsmodels

read at source ↗ huggingface.co

DABStep: Data Agent Benchmark for Multi-step Reasoning

Source: HuggingFace Date: 2025-02-04 URL: https://huggingface.co/blog/dabstep

Summary

Benchmark release from Adyen + HF: DABstep, 450+ real-world data analysis tasks from Adyen payment workloads requiring multi-step agentic reasoning (no single-shot solutions). Best result on the hard set: o3-mini at 16% accuracy ($0.198/task), DeepSeek R1 at 13% ($0.007/task), Claude 3.5 Sonnet at 12%. DeepSeek V3 at 6%. All models significantly underperform on real financial data analysis tasks despite strong general benchmarks.

Implications

Thread: open-weights ecosystem health / agentic patterns. The 16% ceiling for the best model on a real-world financial data analysis benchmark is a sobering calibration. It directly challenges “AI agents can do data analyst work” claims — the gap between benchmark scores and production task completion remains large. The cost dimension (R1 at $0.007/task vs o3-mini at $0.198) makes DeepSeek R1 look compelling even at lower accuracy. The instruction-following failures of reasoning models in agentic contexts are a known issue worth watching: raw reasoning capability doesn’t automatically translate to reliable tool use.

← all signals