2025-04-10 · OpenAI

BrowseComp: a benchmark for browsing agents

modelsresearch

BrowseComp: a benchmark for browsing agents

Source: OpenAI Date: 2025-04-10 URL: https://openai.com/index/browsecomp

Summary

OpenAI research paper from April 2025 introducing BrowseComp, a benchmark designed to evaluate the ability of agents to browse the web and answer questions that require multi-step retrieval across multiple sources — distinct from single-hop lookup tasks. BrowseComp questions are constructed to be unsolvable without genuine web navigation: they require finding specific facts scattered across multiple pages, following links, and synthesizing retrieved content. The benchmark was designed around o3 and GPT-4o with browsing tools.

Implications

Benchmark design as competitive positioning. BrowseComp was released alongside o3’s browsing capabilities and showed o3 performing significantly better than GPT-4o on the benchmark. As with AIME for reasoning, OpenAI is using custom benchmarks to demonstrate leadership in capability areas where the benchmark design itself reflects their model’s strengths. Third-party evaluation matters more than vendor-released benchmarks.

Browsing agents as the practical frontier. The gap between “language model” and “agent that can actually do research on the web” is large. BrowseComp quantifies part of that gap — specifically the multi-hop retrieval challenge. Models that score well here are demonstrably better at the kind of online research that makes an AI agent practically useful for knowledge work.

Thread: agent evaluation. Sits alongside the MLE-Bench (October 2024), GPQA, and SWE-bench as the benchmarks defining what “capable agent” means in the 2025 era. The fact that OpenAI stopped evaluating on SWE-bench verified (February 2026) suggests BrowseComp and similar custom evals are replacing the community benchmarks.

Watch: Whether BrowseComp gets adopted by third parties as a standard or remains primarily an OpenAI internal comparison tool.

← all signals