2025-06-13 · Anthropic

How we built our multi-agent research system

pricingagentsmodelsresearch

How we built our multi-agent research system

Source: Anthropic Engineering Date: 2025-06-13 URL: https://www.anthropic.com/engineering/multi-agent-research-system

Summary

Anthropic built an orchestrator-worker research system with Claude Opus 4 as lead and Claude Sonnet 4 as parallel subagents, achieving 90.2% improvement over single-agent Opus 4 on internal research evals at ~15x the token cost. Tool description quality was a primary failure mode — poor descriptions sent agents “down completely wrong paths.” The team required distributed checkpointing and graceful error handling because minor failures compound unpredictably in stateful agent runs.

Implications

The parallel-agent harness thread. The 90.2% improvement figure is the strongest published claim for multi-agent over single-agent performance on a reasoning task. The 15x token cost makes this a premium configuration — justified for high-value research tasks, not general deployment. This pairs with the building-c-compiler post as a second large-scale multi-agent success story.

Tool ergonomics as a primary failure mode. The “agent-tool interfaces are as critical as human-computer interfaces” framing, backed by evidence that tool descriptions caused catastrophic path errors, is the highest-stakes validation of the ACI-over-prompting principle. Poor tool descriptions don’t just reduce accuracy — they can invalidate entire agent runs.

Compounding failures in production. “Stateful and errors compound” is the key reliability insight from this post. Unlike single-call inference failures, agentic failures cascade — checkpointing is not optional infrastructure, it’s the mechanism that prevents minor bugs from corrupting long runs. This is the production engineering lesson that eval-focused posts tend to underemphasize.

← all signals