2024-07-25 · HuggingFace

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

research

read at source ↗ huggingface.co

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

Source: HuggingFace Date: 2024-07-25 URL: https://huggingface.co/blog/zero-shot-vqa-docmatix

Summary

Research summary proposing LAVE (LLM-Assisted VQA Evaluation), an LLM-as-judge metric for visual question answering that better tracks human preference than exact-match metrics (CIDER, BLEU, ANLS). On Docmatix zero-shot evaluation, LAVE scores 0.58 vs ANLS at 0.002 for the same model — demonstrating that traditional metrics systematically undervalue semantically correct but lexically different answers. Raises the question of whether fine-tuning for metric scores is the right optimization target.

Implications

Thread: open-weights ecosystem health / HF as open-source ML hub. The LAVE proposal is part of a broader movement toward LLM-as-judge evaluation that’s reshaping how VLM quality is measured. If fine-tuning-to-metric optimizes for the wrong target, then the whole pipeline for document VQA benchmarking needs revisiting. This is directly relevant for anyone building RAG pipelines on document corpora — evaluation methodology matters as much as model selection. Watch whether LAVE gets adopted upstream in lighteval or Eleuther’s evaluation framework.

← all signals