2025-01-24 · HuggingFace

We now support VLMs in smolagents!

agentsmodels

We now support VLMs in smolagents!

Source: HuggingFace Date: 2025-01-24 URL: https://huggingface.co/blog/smolagents-can-see

Summary

Library update: smolagents adds VLM support, enabling agents to process visual information in both static (images passed at task start) and dynamic (images injected at each ReAct loop step via callbacks) modes. Demonstrated with a web browsing agent using Qwen2VL-72B (via Fireworks) + Selenium/Helium — agent navigates GitHub Trending and extracts metrics. Works with TransformersModel, OpenAIServerModel, and LiteLLMModel backends. No benchmark numbers; success noted as model-dependent (Qwen2VL-72B and GPT-4o work, SmolVLM can be used locally).

Implications

Model release cadence (agent reasoning). The dynamic callback pattern — injecting screenshots at each step into the agent’s observation — is the correct architecture for computer-use agents. The implementation through step_callbacks in the ReAct loop is minimal and composable; this is the pattern teams building browser or GUI agents should adopt rather than rolling custom image-to-text preprocessing.

HF as open-source ML hub. smolagents adding VLM support positions it as a serious alternative to LangChain/LangGraph for teams building vision-capable agents on open-weights models. The SmolVLM local option means the full browser-agent stack can run without any external API — a meaningful capability for teams that need air-gapped or cost-controlled agent deployments.

← all signals