2025-01-23 · HuggingFace

SmolVLM Grows Smaller – Introducing the 256M & 500M Models!

securitymodels

read at source ↗ huggingface.co

SmolVLM Grows Smaller – Introducing the 256M & 500M Models!

Source: HuggingFace Date: 2025-01-23 URL: https://huggingface.co/blog/smolervlm

Summary

Model release: SmolVLM-256M and SmolVLM-500M — two new VLMs significantly smaller than the prior 2B SmolVLM. Architecture changes vs 2B: smaller SigLIP vision encoder (93M, base patch-16/512 vs 400M SO), higher image resolution, pixel shuffle to 4096px/token (vs 1820). The 256M model claims to surpass Idefics 80B from 17 months prior. Available in base and instruction-tuned variants; compatible with Transformers, MLX, and ONNX. ColSmolVLM variants for multimodal retrieval; WebGPU demos for in-browser inference.

Implications

Open-weights ecosystem health. A 256M VLM outperforming an 80B model from 17 months earlier is a vivid illustration of the efficiency improvement rate in open-weights vision-language models. The WebGPU demo path means SmolVLM-256M can run entirely in-browser — a capability threshold that enables client-side vision processing without any server infrastructure.

Model release cadence. The pixel shuffle increase (4096 vs 1820 pixels per token) with a smaller vision encoder is the key compression trade-off enabling the sub-500M footprint. This design pattern — more aggressive visual tokenization with a lighter backbone — is the direction the field is moving for edge and mobile deployment, and SmolVLM is the reference implementation.

← all signals