Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models
read at source ↗ huggingface.co
Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models
Source: HuggingFace Date: 2025-11-19 URL: https://huggingface.co/blog/ServiceNow-AI/apriel-h1
Summary
Model release and research summary: ServiceNow distills a 15B full-attention reasoning model into Apriel-H1 (Mamba hybrid) achieving 2.1x throughput with minimal quality loss. Key finding: distillation works on high-quality reasoning SFT traces, not pretraining data — reasoning patterns are fragile and require explicit correct examples to transfer. Staged layer conversion (25→40 Mamba layers) with LOO analysis and reverse KL distillation. MATH500: 0.92 vs 0.90 teacher; AIME24: 0.65 vs 0.70. Benchmarks on MATH500, GSM8k, GPQA, MTBench.
Implications
Thread: open-weights ecosystem health / model release cadence. The distillation-on-SFT-traces insight is the genuinely surprising finding: it inverts the intuition that distillation benefits from broad pretraining data. Reasoning capability is concentrated in specific attention patterns that only transfer when the training signal explicitly demonstrates them. The 2.1x throughput gain from attention-to-Mamba conversion is large enough to matter in production. The vLLM integration being blocked on legal review is a recurring friction point for hybrid architecture deployment — the ecosystem isn’t ready for SSM hybrids as a first-class inference target yet.