How NuminaMath Won the 1st AIMO Progress Prize
read at source ↗ huggingface.co
How NuminaMath Won the 1st AIMO Progress Prize
Source: HuggingFace Date: 2024-07-11 URL: https://huggingface.co/blog/winning-aimo-progress-prize
Summary
Research summary and model release: NuminaMath (Numina + HF) wins the 1st AIMO Progress Prize, solving 29/50 problems on the private test set. Base model: DeepSeekMath-Base 7B, fine-tuned in two stages — CoT training on ~500K math problems, then Tool-Integrated Reasoning (TIR) training on ~60K problems with code execution. Inference: SC-TIR algorithm generating 48 candidates with depth-4 code-feedback loops, majority voting on final answers. MATH benchmark: NuminaMath-7B-TIR 68.2% vs GPT-4o 76.6% and DeepSeekMath-7B-RL 58.8%. Training: 8xH100, 10 hours per run.
Implications
Model release cadence (reasoning). NuminaMath-7B-TIR at 68.2% on MATH — 9.4 points above DeepSeekMath-7B-RL with the same base model — demonstrates that inference-time tool use (Python code execution with self-correction loops) adds substantial reasoning capability beyond fine-tuning alone. The SC-TIR algorithm (majority vote over 48 code-feedback trajectories) is the key inference technique.
Open-weights ecosystem health. A 7B model approaching GPT-4o on mathematical reasoning using open-source training infrastructure (DeepSeekMath base, HF training stack, 8xH100) is a compelling demonstration that frontier math capabilities are not gated by scale alone. The code execution feedback loop makes the model’s reasoning verifiable — a structural advantage for math that pure text generation cannot match.