2025-04-02 · OpenAI

PaperBench: Evaluating AI’s Ability to Replicate AI Research

modelsresearch

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Source: OpenAI Date: 2025-04-02 URL: https://openai.com/index/paperbench

Summary

OpenAI’s introduction of PaperBench — an evaluation that tests whether AI systems can replicate published AI research papers, from reading the paper to implementing the experiments to reproducing the results. The benchmark measures the full research replication pipeline rather than individual tasks, making it a proxy for AI-assisted research capability at a systems level.

Implications

Research/eval thread. PaperBench is a significant capability evaluation because research replication is a high-complexity, multi-step task requiring code understanding, experimental design, and debugging. If AI systems can reliably replicate AI research, they’re approaching the capability to conduct AI research autonomously — the recursive self-improvement threshold that safety researchers treat as a critical capability milestone. OpenAI publishing this benchmark suggests their internal models were already achieving meaningful replication rates, and they wanted a public metric for the field.

← all signals