2024-07-17 · OpenAI

Prover-Verifier Games improve legibility of language model outputs

research

Prover-Verifier Games improve legibility of language model outputs

Source: OpenAI Date: 2024-07-17 URL: https://openai.com/index/prover-verifier-games-improve-legibility

Summary

OpenAI published research on Prover-Verifier Games as a technique for improving the legibility of language model outputs — using a game-theoretic framework where a prover model is trained to produce outputs that a verifier model can check, incentivizing the prover to generate more transparent and auditable reasoning rather than opaque outputs that happen to be correct.

Implications

Safety/alignment thread. Legibility of model outputs is a scalable oversight problem: if models produce correct answers through reasoning humans can’t follow, human verification fails as models become more capable. Prover-Verifier Games are a principled approach to incentivizing models to produce more checkable outputs — the prover is rewarded only when the verifier can confirm correctness, creating pressure toward transparent reasoning. This connects directly to the chain-of-thought transparency research and the concerns about hidden reasoning in o-series models. The game-theoretic framing is elegant because it doesn’t require humans to specify what “legible” means — the verification dynamic shapes it emergently.

← all signals