2024-07-24 · OpenAI

Improving Model Safety Behavior with Rule-Based Rewards

research

Improving Model Safety Behavior with Rule-Based Rewards

Source: OpenAI Date: 2024-07-24 URL: https://openai.com/index/improving-model-safety-behavior-with-rule-based-rewards

Summary

OpenAI published research on using rule-based rewards (RBR) to improve model safety behavior — a technique where explicit rule satisfaction signals are incorporated into the reinforcement learning training signal rather than relying solely on human preference data. The approach aimed to make safety-relevant model behaviors more robust and consistent than pure RLHF alone could achieve.

Implications

Safety/alignment thread. Rule-based rewards represent a middle path between pure constitutional AI (rule specification without RL) and pure RLHF (preference learning without explicit rules). The core problem being addressed is that RLHF-trained safety behaviors can be brittle — models learn to satisfy the shape of human preferences rather than the underlying rules, and adversarial prompting can expose the gap. RBR is an architectural response to that brittleness. This research feeds directly into the alignment approaches visible in GPT-4o and o-series models; the technique matters more than the specific publication.

← all signals