2025-08-07 · HuggingFace

Vision Language Model Alignment in TRL ⚡️

capital

Vision Language Model Alignment in TRL ⚡️

Source: HuggingFace Date: 2025-08-07 URL: https://huggingface.co/blog/trl-vlm-alignment

Summary

HuggingFace’s TRL library now supports alignment techniques for Vision Language Models beyond basic SFT and DPO. The post walks through Mixed Preference Optimization (MPO), GRPO, and GSPO—methods that combine multiple loss signals or use batch-level policy updates to address distribution shift and reward noise in multimodal reasoning. Practical benchmarks show GRPO/GSPO outperforming MPO on geometric reasoning tasks (MathVista +6.2 pts for MPO baseline; GRPO/GSPO do better still).

Implications

Multimodal fine-tuning is converging with the RL-from-feedback stack—techniques pioneered on text (GRPO from DeepSeek, GSPO from Qwen) are now first-class in the standard open-source trainer.
The vLLM colocate/server integration lowers the infrastructure bar for online preference optimization, making production-grade VLM alignment accessible without custom orchestration.
Feeds the local-inference and custom-model threads: reproducible fine-tuning recipes for mid-size VLMs (e.g., Qwen2.5-VL-3B) are now as composable as text-only pipelines.

← all signals