2025-07-10 · HuggingFace

Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models

modelsresearch

Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models

Source: HuggingFace Date: 2025-07-10 URL: https://huggingface.co/blog/AI-MO/kimina-prover

Summary

Model release: Kimina-Prover-72B, 8B, and 1.7B from Moonshot AI/Project Numina — formal theorem provers built on Qwen with test-time RL search (TTRL) for lemma discovery. Error-fixing capability: model reads Lean 4 error messages and proposes targeted fixes. 16+16 attempt-and-fix strategy (35.6% pass) vs 32×1 brute-force (28.8%) on equal budgets. Benchmarks on miniF2F-test: 72B pass@1 63.9%, pass@32 84.0%, pass@1024 87.7%. With full TTRL search framework: 92.2% pass rate. Beats DeepSeek-Prover-V2-671B (61.9% pass@1, 86.6% pass@1024) with 9x fewer parameters at the 72B level.

Implications

Model release cadence (reasoning). Kimina-Prover-72B outperforming DeepSeek-Prover-V2-671B on miniF2F at 1/9 the parameter count is a significant efficiency claim for formal reasoning. The error-fixing capability — treating Lean 4 error messages as feedback for the next generation attempt — is a qualitatively different approach from pure sampling: it uses the verifier’s error signal to guide search rather than just increasing sample count.

Open-weights ecosystem health. TTRL as a test-time RL framework for lemma-enabled proof search is a generalizable technique beyond Lean 4 — the pattern of discovering, combining, and applying sub-problems recursively at inference time could apply to any domain with verifiable correctness. Project Numina (HF collaboration for math reasoning) releasing Kimina on HF continues the math reasoning open-source trajectory that started with NuminaMath.

← all signals