SmolVLM - small yet mighty Vision Language Model
read at source ↗ huggingface.co
SmolVLM - small yet mighty Vision Language Model
Source: HuggingFace Date: 2024-11-26 URL: https://huggingface.co/blog/smolvlm
Summary
Model release: SmolVLM — a 2B VLM family (Base, Synthetic, Instruct) designed for on-device and consumer GPU deployment. Apache 2.0 licensed. Architecture: SmolLM2 1.7B backbone, 9x visual compression, SigLIP vision encoder, 16k context. Peak VRAM: 5.02GB (vs Qwen2-VL 2B at 13.7GB). Benchmarks: MMMU 38.8, DocVQA 81.6, TextVQA 72.7 — below Qwen2-VL 2B on accuracy but 7.5-16x faster throughput per token and 3.3-4.5x faster prefill.
Implications
Open-weights ecosystem health. SmolVLM makes vision-language capability deployable on a single consumer GPU for the first time in the HuggingFace family — 5GB VRAM fits on an RTX 3060 12GB or M2 MacBook Pro. The accuracy-vs-speed trade-off (behind Qwen2-VL on benchmarks but 16x faster at generation) is the right one for latency-sensitive applications like agents and interactive demos.
Model release cadence. The aggressive visual compression (9x vs Idefics3’s 4x) encoding a single image in 1.2k tokens vs Qwen2-VL’s 16k is the architectural choice that enables the small footprint. This is the design point to watch as the field moves toward multi-image and video inputs at resource-constrained deployment targets — compression rate matters more than absolute accuracy when tokens are the binding constraint.