2025-01-31 · HuggingFace

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

modelsresearch

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Source: HuggingFace Date: 2025-01-31 URL: https://huggingface.co/blog/open-r1/mini-r1-contdown-game

Summary

Integration tutorial and research reproduction: Mini-R1 reproduces DeepSeek R1’s emergent reasoning behavior using GRPO (Group Relative Policy Optimization) on Qwen2.5-3B-Instruct trained on the Countdown arithmetic puzzle task. Training on 4xH100, ~6 hours, 450 steps. Progression: step 50 model learns correct <think>...</think><answer>...</answer> format; step 200 shifts to trial-and-error reasoning; step 450 achieves ~50% success. Format Reward + Accuracy Reward dual reward functions. Checkpoint released publicly.

Implications

Model release cadence (reasoning). Reproducing DeepSeek R1’s “aha moment” (emergent reasoning reallocation via RL) in 6 hours on 4xH100 with a 3B model demonstrates that the core insight is not scale-dependent. The Countdown task is a clean, verifiable domain where reward signals are unambiguous — the pattern of starting from an instruction-tuned base and applying GRPO with format+accuracy rewards is now a standard starting point for reasoning model development.

Open-weights ecosystem health. A fully reproducible open tutorial — training scripts, config files, reward functions, checkpoints — for the technique behind DeepSeek R1’s reasoning is a significant ecosystem contribution. Teams that want to apply GRPO-style RL to their own verifiable domains (math, code, tool-use) now have a reference implementation that requires only 4xH100 and 6 hours to verify works before scaling up.

← all signals