2024-12-20 · OpenAI

Deliberative alignment: reasoning enables safer language models

research

Deliberative alignment: reasoning enables safer language models

Source: OpenAI Date: 2024-12-20 URL: https://openai.com/index/deliberative-alignment

Summary

OpenAI research on “deliberative alignment” — using the model’s own reasoning (chain-of-thought) to make safety decisions rather than relying purely on trained refusal behaviors. The approach prompts the model to reason explicitly about safety guidelines before responding, similar to how o1’s extended thinking enables more careful judgment on complex requests.

Implications

Safety/alignment thread. Deliberative alignment is a significant departure from the standard RLHF safety approach — instead of training direct refusal behaviors, you train the model to reason about safety in-context. This has advantages (generalizes better to novel situations, produces explainable safety decisions) and risks (reasoning can be manipulated, “galaxy-brained” justifications for harmful outputs). The December 2024 paper is the theoretical foundation for how o1 and subsequent reasoning models handle safety — it’s the alignment architecture behind the o-series models specifically.

← all signals