2025-05-12 · HuggingFace

Vision Language Models (Better, faster, stronger)

models

Vision Language Models (Better, faster, stronger)

Source: HuggingFace Date: 2025-05-12 URL: https://huggingface.co/blog/vlms-2025

Summary

Comprehensive HF state-of-the-VLM landscape post covering April 2024–May 2025. Catalogs new architectures (any-to-any models, reasoning VLMs, sub-2B small models, MoE decoders, vision-language-action models for robotics), specialized capabilities (object detection, multimodal safety, multimodal RAG), and training methods including DPO for VLMs with TRL. Highlights include Qwen 2.5-VL’s object grounding, SmolVLM2’s small footprint, GR00T N1 for robot policies, and smolagents gaining vision support.

Implications

Open-weights ecosystem health. The breadth of the recap — from 2B-parameter efficiency models to 72B reasoning VLMs — shows the open-weights tier has substantially narrowed the gap with proprietary models on vision tasks over the year. The robotics VLA category (π0, GR00T N1) is now a live branch of open-weights development.

Transformers library trajectory. Multiple training recipes (DPO for VLMs, smolagents vision support, multimodal RAG patterns) are now documented with code in the HF ecosystem — the library is absorbing practitioner workflows faster than a year ago.

HF as open-source ML hub. A post of this scope — one year of VLM development in one place, with runnable code — signals HF’s editorial function: the hub as curriculum, not just registry.

← all signals