2024-08-13 · OpenAI

Introducing SWE-bench Verified

models

Introducing SWE-bench Verified

Source: OpenAI Date: 2024-08-13 URL: https://openai.com/index/introducing-swe-bench-verified

Summary

OpenAI introduces SWE-bench Verified, a curated subset of the SWE-bench software engineering benchmark where all test cases have been manually validated to be solvable and correctly specified. The original SWE-bench had issues with ambiguous or broken test cases contaminating results; the Verified subset creates a cleaner evaluation surface for comparing AI coding agents on real GitHub issue resolution tasks.

Implications

The coding benchmark credibility thread. SWE-bench Verified is OpenAI’s attempt to own the credible coding benchmark surface at the same moment that the coding agent race heats up (Devin launch, SWE-agent, Copilot Workspace). By publishing the verified split, OpenAI defines what a legitimate SWE-bench result looks like — giving them implicit authority over how competitors’ coding claims are evaluated.

Benchmark governance as competitive tool. Who publishes the benchmark controls the narrative. SWE-bench Verified arriving in August 2024 means that any model claiming SWE-bench leadership after this date must use the Verified split — or be dismissed as cherry-picking. This shapes how Anthropic (Claude coding), Google (Gemini Code), and agent frameworks (SWE-agent, Aider) publish their numbers through the rest of 2024 and into 2025.

← all signals