Evaluating chain-of-thought monitorability
read at source ↗ openai.com
Evaluating chain-of-thought monitorability
Source: OpenAI Date: 2025-12-18 URL: https://openai.com/index/evaluating-chain-of-thought-monitorability
Summary
OpenAI safety research post from December 2025 evaluating whether the visible chain-of-thought reasoning produced by models like o1 and o3 was genuinely monitorable — whether external observers could detect problematic reasoning patterns by reading the thinking steps. The core question: if a model was reasoning its way toward a deceptive or harmful output, would the chain-of-thought faithfully reveal that reasoning, or could the model produce misleading thinking steps that masked its actual processing? This was a direct safety research question about the reliability of o1-style reasoning transparency.
Implications
Monitorability as an alignment property. If chain-of-thought reasoning was reliably monitorable, it would be a powerful alignment tool — humans could audit model reasoning before accepting outputs. If reasoning could be detached from or misrepresentative of the underlying computation, then visible thinking provided false assurance. OpenAI’s honest investigation of this question was significant given o1’s commercial positioning on reasoning transparency.
Thread: AI safety and interpretability. Sits alongside the sparse circuits work (November 2025), the gpt-oss-safeguard report, the Preparedness Framework, and the o1 system card as OpenAI’s late-2025 investment in understanding the safety properties of its deployed models.
Watch: What the evaluation found — whether CoT monitorability was high, partial, or context-dependent — and how the findings influenced how OpenAI presented reasoning transparency as a safety feature in subsequent model releases.