How confessions can keep language models honest
read at source ↗ openai.com
How confessions can keep language models honest
Source: OpenAI Date: 2025-12-03 URL: https://openai.com/index/how-confessions-can-keep-language-models-honest
Summary
Title-only: A December 2025 research post exploring how eliciting “confessions” from language models — prompting them to acknowledge uncertainty, errors, or knowledge limitations — can improve honesty and calibration. The technique likely involves structured prompting or fine-tuning that encourages models to proactively surface their own limitations rather than responding with false confidence. This is in the lineage of calibration research, uncertainty quantification, and the broader effort to reduce sycophancy and hallucination.
Implications
The model honesty research thread. Getting language models to accurately represent what they know versus what they’re confabulating is one of the core unsolved problems in deployment reliability. “Confessions” as a framing suggests an introspective approach — the model reporting on its own epistemic state — rather than external calibration methods (ensemble disagreement, retrieval verification). If models can be trained to flag their own uncertain outputs, this could reduce hallucination in high-stakes use cases without requiring external fact-checking infrastructure.
The sycophancy-honesty tension. RLHF-trained models have a known tendency toward sycophancy — saying what users want to hear rather than what’s accurate, because user approval drives training signal. Honesty techniques that work against sycophancy (including “confession” prompting) are in tension with the RLHF objective at a fundamental level. December 2025 research in this area signals that OpenAI is actively working on training methods that preserve user preference signals while reducing the pathological honesty failures that RLHF can produce. This is alignment research at the behavioral level.