Deliberative alignment: reasoning enables safer language models
read at source ↗ openai.com
Deliberative alignment: reasoning enables safer language models
Source: OpenAI Date: 2024-12-20 URL: https://openai.com/index/deliberative-alignment
Summary
OpenAI research on “deliberative alignment” — using the model’s own reasoning (chain-of-thought) to make safety decisions rather than relying purely on trained refusal behaviors. The approach prompts the model to reason explicitly about safety guidelines before responding, similar to how o1’s extended thinking enables more careful judgment on complex requests.
Implications
Safety/alignment thread. Deliberative alignment is a significant departure from the standard RLHF safety approach — instead of training direct refusal behaviors, you train the model to reason about safety in-context. This has advantages (generalizes better to novel situations, produces explainable safety decisions) and risks (reasoning can be manipulated, “galaxy-brained” justifications for harmful outputs). The December 2024 paper is the theoretical foundation for how o1 and subsequent reasoning models handle safety — it’s the alignment architecture behind the o-series models specifically.