Learning to reason with LLMs
read at source ↗ openai.com
Learning to reason with LLMs
Source: OpenAI Date: 2024-09-12 URL: https://openai.com/index/learning-to-reason-with-llms
Summary
OpenAI’s technical companion paper to the o1 launch (September 12, 2024), providing more detail on the training methodology for the o-series reasoning models. The core contribution: using reinforcement learning on chain-of-thought reasoning traces to improve the quality of extended thinking — the model learns to reason better by being rewarded for correct reasoning steps, not just correct final answers. Published the same day as the o1 introduction, but more technically substantive.
Implications
RL on reasoning as the key insight. The o1 approach — RL training on the reasoning process itself, not just the output — is the technical breakthrough that distinguishes the o-series from prior chain-of-thought prompting approaches. Prompting a model to think step-by-step elicits existing reasoning capabilities; training on the quality of reasoning steps teaches the model to reason better. That’s a fundamentally different intervention.
The training recipe question. This paper revealed enough to confirm the RLHF-on-CoT approach but left the specifics of the reward model and process reward model (PRM) construction opaque. The subsequent DeepSeek R1 paper (January 2025) was partly valuable because it published implementation details that OpenAI’s paper omitted.
Thread: reasoning model training. This is the foundational technical document for understanding why the o-series models behave differently from GPT-4o. Read alongside the chain-of-thought monitoring paper (March 2025) and the DeepSeek R1 technical report for a complete picture of the reasoning model training landscape.
Watch: Whether the RL-on-reasoning approach requires proprietary infrastructure or whether the core technique can be replicated with open models — DeepSeek R1’s success suggests the latter.