2026-01-09 · Anthropic

Demystifying evals for AI agents

agentsmodels

Demystifying evals for AI agents

Source: Anthropic Engineering Date: 2026-01-09 URL: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

Summary

Anthropic’s practical guide to building agent evals: start with 20-50 real production failure cases, combine code-based (fast, brittle), model-based (flexible, non-deterministic), and human graders, and track both pass@k (at least one success in k attempts) and pass^k (all k trials succeed) metrics. The post notes that SWE-Bench Verified is nearing saturation at >80% for frontier models, signaling when a benchmark stops providing improvement signal.

Implications

The eval-reliability thread. This is Anthropic’s customer-facing eval methodology — the infrastructure-noise post (resource allocation confounders) and this post together form the complete eval guidance: instrument your environment, collect real failures, use multi-grader approaches. The transcript-reading recommendation (“you won’t know if your graders are working well unless you read the transcripts”) is the manual verification step that automated pipelines skip at their peril.

pass^k as a production metric. pass^k (all trials succeed) is more relevant than pass@k for production deployments — a task that succeeds 50% of the time is not production-ready. This is a useful reframe from benchmark-style thinking (best-of-k) to deployment thinking (consistent reliability).

Benchmark saturation acknowledged. Anthropic explicitly flagging SWE-Bench saturation at >80% is significant — it’s the signal that the industry needs new evaluation surfaces, which feeds directly into the infrastructure-noise post’s argument about eval methodology rigor.

← all signals