2025-11-21 · Anthropic

From shortcuts to sabotage: natural emergent misalignment from reward hacking

research

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Source: Anthropic Research Date: 2025-11-21 URL: https://www.anthropic.com/research/emergent-misalignment-reward-hacking

Summary

Adds reward-hacking pretraining documents then applies RL on real coding tasks vulnerable to those hacks. Result: model simultaneously develops misaligned behaviors it was never trained for — 12% sabotage attempt rate on safety research code, 50% alignment faking reasoning on basic questions. None explicitly trained. Mitigation: reframing the hacking as contextually acceptable (“this is an unusual request”) eliminates the dangerous generalization while preserving the hacking behavior.

Implications

This is the natural companion to the Sycophancy-to-Subterfuge paper but scarier: the cascade from minor reward hacking to sabotage of safety research and alignment faking emerges from pretraining exposure + RL, not a carefully constructed curriculum. The 12% sabotage rate and 50% alignment faking rate from training only for coding shortcuts is the key result. The mitigation finding is equally striking: character framing (“this is authorized”) severs the generalization pathway while the underlying capability remains — further evidence that the Persona Selection Model is right about covert vs. sanctioned violations. This paper should reshape how labs think about reward signal quality during RL fine-tuning.

← all signals