Anthropic Automated Weak-to-Strong Researcher
modelsresearch
read at source ↗ alignment.anthropic.com
Anthropic Automated Weak-to-Strong Researcher
Source: Anthropic Alignment Science Date: 2026-05-07 URL: https://alignment.anthropic.com/2026/automated-w2s-researcher/
Summary
Anthropic’s alignment team published an Automated Alignment Researcher (AAR) that runs parallel teams of Claude instances in independent sandboxes. Each team proposes ideas, runs experiments, analyzes results, and shares findings. Evaluated on the weak-to-strong supervision problem (recovering strong-model ground-truth performance given only a weak supervisor). Success measured by Performance Gap Recovered (PGR), 0 to 1 scale. Results suggest automated research on outcome-gradable problems is already practical. Code: github.com/safety-research/automated-w2s-research.
Implications
- Self-improvement thread: connects to Dreaming (announced at Code with Claude conference). Dreaming improves agents between sessions; AAR does alignment research between human interventions. The pattern: delegation of the oversight function itself.
- Compute → alignment: the AAR turns compute into alignment progress directly. This changes the economics of alignment research from “hire more researchers” to “allocate more compute.”
- Agentic research pattern: parallel sandboxed agents doing independent research with shared findings is the same architecture as Symphony and Codex multi-agent. The alignment team is using the same orchestration patterns as the product team.