Why we no longer evaluate SWE-bench Verified
read at source ↗ openai.com
Why we no longer evaluate SWE-bench Verified
Source: OpenAI Date: 2026-02-23 URL: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified
Summary
OpenAI’s February 2026 explanation for discontinuing SWE-bench Verified as an evaluation benchmark for its coding models. SWE-bench Verified, which tested models on real GitHub issues requiring code fixes, had been the primary public benchmark for coding AI performance through 2024-2025. OpenAI’s decision to stop evaluating against it reflected a known problem in competitive AI benchmarking: once a benchmark becomes the primary public comparison metric, labs optimize against it, and the benchmark loses its ability to distinguish genuine capability from benchmark-specific tuning.
Implications
Benchmark saturation and goodhart’s law. When SWE-bench Verified scores became the headline metric for coding model announcements, it incentivized training approaches that maximized SWE-bench performance specifically — which may or may not generalize to real-world software engineering tasks. OpenAI’s public acknowledgment of this problem was notable: admitting a benchmark is compromised is a credibility move, but it also leaves a vacuum for what evaluation comes next.
Thread: AI evaluation methodology. Sits alongside the SimpleQA evaluation introduction (October 2024), the Codex security evaluation questions, and the broader discussion of how to evaluate agentic coding capabilities that had grown through 2025. The SWE-bench departure is part of a larger reckoning with how to measure coding AI performance honestly.
Watch: What OpenAI proposed as a replacement evaluation methodology for coding capabilities, and whether the broader industry followed in moving away from SWE-bench Verified or continued using it despite the known limitations.