2024-07-10 · HuggingFace

Preference Optimization for Vision Language Models

modelsresearch

Preference Optimization for Vision Language Models

Source: HuggingFace Date: 2024-07-10 URL: https://huggingface.co/blog/dpo_vlm

Summary

Library update + tutorial: TRL adds DPO (Direct Preference Optimization) support for VLMs (Idefics2, LLaVA 1.5, PaliGemma). Tutorial walks through dataset preparation using RLAIF-V-Dataset (83k+ annotated rows), memory optimization via bfloat16 + LoRA (160GB → 32.3GB; trainable params reduced from 8B to 55M), and full training script. AMBER hallucination benchmark: DPO-fine-tuned Idefics2 scores 85.9 accuracy / 89.4 F1 vs. 85.8 / 89.1 baseline — modest but measurable improvement.

Implications

Transformers library trajectory. DPO for VLMs in TRL means the preference-optimization workflow that improved text LLMs is now systematically available for multimodal models. The 32.3GB memory footprint for an 8B VLM fine-tune is achievable on a single high-end consumer or cloud GPU, making hallucination reduction fine-tuning accessible without multi-GPU setups.

Open-weights ecosystem health. VLM hallucination is one of the most common practical complaints about open-weights multimodal models in production. Providing a reproducible DPO recipe that demonstrably reduces it (even marginally) on open hardware closes a gap that previously required proprietary RLHF infrastructure.

← all signals