2026-05-07 · Anthropic

Anthropic Automated Weak-to-Strong Researcher

modelsresearch

read at source ↗ alignment.anthropic.com

Anthropic Automated Weak-to-Strong Researcher

Source: Anthropic Alignment Science Date: 2026-05-07 URL: https://alignment.anthropic.com/2026/automated-w2s-researcher/

Summary

Anthropic’s alignment team published an Automated Alignment Researcher (AAR) that runs parallel teams of Claude instances in independent sandboxes. Each team proposes ideas, runs experiments, analyzes results, and shares findings. Evaluated on the weak-to-strong supervision problem (recovering strong-model ground-truth performance given only a weak supervisor). Success measured by Performance Gap Recovered (PGR), 0 to 1 scale. Results suggest automated research on outcome-gradable problems is already practical. Code: github.com/safety-research/automated-w2s-research.

Implications

  • Self-improvement thread: connects to Dreaming (announced at Code with Claude conference). Dreaming improves agents between sessions; AAR does alignment research between human interventions. The pattern: delegation of the oversight function itself.
  • Compute → alignment: the AAR turns compute into alignment progress directly. This changes the economics of alignment research from “hire more researchers” to “allocate more compute.”
  • Agentic research pattern: parallel sandboxed agents doing independent research with shared findings is the same architecture as Symphony and Codex multi-agent. The alignment team is using the same orchestration patterns as the product team.

← all signals