SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data
read at source ↗ huggingface.co
SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data
Source: HuggingFace Date: 2025-06-03 URL: https://huggingface.co/blog/smolvla
Summary
Model release: SmolVLA-450M, a lightweight VLA for robotics trained on 487 community LeRobot datasets (~10M frames, <30K episodes). Architecture: SmolVLM2-500M backbone + 100M flow-matching action expert. Achieves 78.3% success on SO100 real-world tasks. Asynchronous inference: 30% faster response time, 2x task throughput (9.7s vs 13.75s). Runs on consumer hardware and MacBook CPU. Fine-tunable on single consumer GPU.
Implications
Thread: open-weights ecosystem health. SmolVLA demonstrates that community data curation at modest scale (< 30K episodes) is sufficient to train a competitive robotics VLA. The 64 visual token reduction and half-VLM layers design is aggressive but apparently effective — the architecture trades compute for accessibility intentionally. The asynchronous inference pattern (action prediction decoupled from control loop) yielding 2x throughput is a key practical contribution: robot control systems are real-time environments where latency directly limits task performance. SmolVLA + SO-100 + LeRobot dataset pipeline is becoming a coherent open robotics research stack.