2025-09-10 · HuggingFace

Jupyter Agents: training LLMs to reason with notebooks

agentsmodelsresearch

Jupyter Agents: training LLMs to reason with notebooks

Source: HuggingFace Date: 2025-09-10 URL: https://huggingface.co/blog/jupyter-agent-2

Summary

Research + model release from HF: Jupyter Agent, a fine-tuned Qwen3-4B specialized for data science notebook tasks. 7-stage synthetic data pipeline from 2TB Kaggle notebooks → 51k curated examples (0.2B tokens), with Qwen3-32B educational scoring and Qwen3-Coder-480B trace generation. Results: base Qwen3-4B at 38.67% → fine-tuned at 75% on easy DABStep split. Simplified scaffolding alone (without training) moves the baseline from 38.67% to 52.78%. Models, dataset, and code all released.

Implications

Thread: transformers library trajectory / open-weights ecosystem health. The 36% absolute improvement from fine-tuning on task-specific synthetic data (vs base model with simplified scaffolding) reinforces the consistent pattern across this batch: domain fine-tuning at small scale consistently beats larger general models. The 7-stage pipeline (particularly the 90% deduplication finding — 2TB → 250GB) is production data engineering knowledge applicable to any synthetic notebook-based training. Scaffolding-alone improvement (44% → 59%) matters for practitioners who can’t fine-tune: good scaffolding recovers substantial capability from any base model. Watch whether jupyter-agent becomes a template for task-specific coding agents.

← all signals