2026-04-14 · Anthropic

Automated Alignment Researchers: Using large language models to scale scalable oversight

modelsresearch

Automated Alignment Researchers: Using large language models to scale scalable oversight

Source: Anthropic Research Date: 2026-04-14 URL: https://www.anthropic.com/research/automated-alignment-researchers

Summary

Nine Claude Opus 4.6 instances (“Automated Alignment Researchers”) given tools, shared forum, and sandboxes to work on weak-to-strong supervision (Qwen 3-4B as strong model, Qwen 1.5-0.5B as weak teacher). After 800 cumulative research hours, AARs achieve 0.97 PGR (performance gap recovery) vs. 0.23 human baseline. Mixed generalization: 0.94 PGR on math, 0.47 on coding, no statistically significant improvement at production scale.

Implications

This is the AI-accelerated alignment research thread made real. 0.97 PGR vs. 0.23 human baseline is a dramatic result on the narrow benchmark, but the production scale failure and coding generalization gap are the honest caveats. The key implication: AI agents can do useful alignment research on well-defined problems but don’t generalize to harder ones the way human researchers do. Still, even narrow acceleration of alignment research is valuable given the race between capabilities and safety. Watch for this framing feeding Anthropic’s argument that AI can help solve the alignment problem — and watch the production scale failure as the limiting factor that needs solving before this becomes a real research multiplier.

← all signals