Smol2Operator: Post-Training GUI Agents for Computer Use
read at source ↗ huggingface.co
Smol2Operator: Post-Training GUI Agents for Computer Use
Source: HuggingFace Date: 2025-09-23 URL: https://huggingface.co/blog/smol2operator
Summary
Model release and research summary: Smol2Operator — SmolVLM2-2.2B fine-tuned for GUI automation via two-phase pipeline. Phase 1: SFT on 459K GUI grounding samples → ScreenSpot-v2 from 0.47% to 41.27% (+41%). Phase 2: SFT on 784K agentic reasoning samples → 61.71% (+20%). Key finding: 1152px resolution and normalized (0-1) coordinates outperform pixel coordinates. Unified action space across mobile/desktop datasets. Bonus: 460M nanoVLM achieves ~58% on ScreenSpot-v2 with same methodology. All training data, model, and code released.
Implications
Model release cadence (agent reasoning). Starting from 0.47% and reaching 61.71% on GUI grounding through SFT alone — no RL, no novel architecture — demonstrates that training data quality and action space normalization are the primary levers for GUI agent capability. The two-phase structure (grounding first, then agentic reasoning) is a reproducible recipe that any team can apply to other VLM bases.
Open-weights ecosystem health. A 460M model at ~58% on ScreenSpot-v2 using the same methodology as the 2.2B model is the headline for hardware-constrained deployment: viable GUI agents at 460M parameters opens computer-use automation to devices that couldn’t run a 2B+ model. The complete dataset + training code release makes this the reference implementation for open-weights GUI agents.