2025-02-02 · HuggingFace

Open-R1: Update #1

models

Open-R1: Update #1

Source: HuggingFace Date: 2025-02-02 URL: https://huggingface.co/blog/open-r1/update-1

Summary

Project progress update: Open-R1 team successfully reproduces DeepSeek-R1’s MATH-500 scores (R1-Distill-Qwen-7B: 91.8 vs. reported 92.8; R1-Distill-Qwen-32B: 95.0 vs. 94.3). GRPO lands in TRL 0.14 with DeepSpeed ZeRO support, vLLM integration, and a clean training API. Synthetic data generation scaled to 4x8xH100 (32 GPUs) using streaming requests. Community datasets released: OpenThoughts-114k, Bespoke-Stratos-17k, Magpie-Reasoning-V2-250K. Key finding: DeepSeek-R1 averages 6,000 tokens/response (max 20k+), making efficient generation and training a significant infrastructure challenge.

Implications

Open-weights ecosystem health. GRPO in TRL 0.14 is the most consequential part of this update — it makes the DeepSeek-R1 training technique available in the standard HF training stack with a simple Python API. Combined with the community reasoning datasets (250k+ samples), the full R1 training pipeline is now reproducible by any team with access to multi-GPU compute, not just the original lab.

Model release cadence — reasoning thread. The 6,000-token average response length finding is a practical constraint that shapes infrastructure requirements for reasoning model training. This length drove the streaming inference optimization (avoiding GPU cache preemption) that enabled scale-out to 32 GPUs. Teams attempting to reproduce or extend R1-style training need to account for this when sizing their inference compute.

← all signals