2026-02-05 · Anthropic

Quantifying infrastructure noise in agentic coding evals

agentsmodels

Quantifying infrastructure noise in agentic coding evals

Source: Anthropic Engineering Date: 2026-02-05 URL: https://www.anthropic.com/engineering/infrastructure-noise

Summary

Anthropic researchers found that infrastructure resource allocation materially shifts agentic coding benchmark scores — container headroom alone caused a 6-point swing on Terminal-Bench 2.0 (p < 0.01), with infrastructure error rates dropping from 5.8% to 0.5% as resources increased. Beyond a ~3x multiplier, extra resources stop just stabilizing results and start enabling agents to solve previously unsolvable problems, changing what the benchmark actually measures.

Implications

The eval-reliability thread. This is direct methodology work from Anthropic on their own harness — infrastructure documentation is now a prerequisite for trusting any leaderboard delta. Small score gaps between models (often 1-3 points) are within the noise floor that container config alone can produce, which puts pressure on every vendor publishing evals without specifying allocation tiers.

Claude Code harness design. The recommendation to specify both guaranteed allocation and hard kill thresholds separately is a concrete design decision for any shop running self-hosted Claude Code evals. The 3x multiplier finding gives a calibration starting point.

Demystifying-evals thread. Pairs directly with Anthropic’s “Demystifying Evals for AI Agents” post — the methodological rigor angle is consistent: Anthropic is building a public case that eval infrastructure deserves the same scrutiny as model weights.

← all signals