2025-01-06 · Anthropic

Claude SWE-Bench Performance

models

read at source ↗ www.anthropic.com

Claude SWE-Bench Performance

Source: Anthropic Engineering Date: 2025-01-06 URL: https://www.anthropic.com/engineering/swe-bench-sonnet

Summary

Claude 3.5 Sonnet achieved 49% on SWE-bench Verified (surpassing the prior 45% SOTA) using a minimal two-tool harness: a Bash executor and a file editor. The team required absolute file paths to prevent navigation errors and let the model determine its own approach rather than constraining it to rigid workflows. Many successful runs exceeded 100k tokens, and hidden test specifications in the benchmark caused false positives.

Implications

The eval-reliability thread. The false-positive issue from hidden test specs is a concrete benchmark reliability problem — SWE-bench’s grading surface has known leakage. This pairs with the infrastructure-noise post: the eval environment itself introduces confounders beyond just resource allocation.

Tool interface over model capability. “Much more attention should go into designing tool interfaces for models” is the SWE-bench-sourced validation of the ACI-over-prompting principle from the building-effective-agents post. The absolute-filepath requirement appears in both as a named example, suggesting it’s a widely applicable lesson.

Cost floor for serious coding tasks. Successful runs exceeding 100k tokens sets a cost baseline for real-world software engineering tasks — this isn’t a benchmark quirk. Any shop deploying Claude Code for non-trivial issue resolution needs to budget for high-token runs as the norm, not the exception.

← all signals