2025-01-10 · HuggingFace

Visual Document Retrieval Goes Multilingual

models

Visual Document Retrieval Goes Multilingual

Source: HuggingFace Date: 2025-01-10 URL: https://huggingface.co/blog/vdr-2b-multilingual

Summary

Model release: vdr-2b-multi-v1, a multilingual visual document retrieval embedding model supporting Italian, Spanish, English, French, and German. Encodes document page screenshots into dense vectors — no OCR, chunking, or text extraction required. Built on MrLight/dse-qwen2-2b-mrl-v1 with 3x faster inference (768 vs 2560 image tokens). Benchmarked on ViDoRe: German NDCG@5 +3.4% vs base model, French +2.2%; cross-lingual retrieval (Italian queries on German documents) +2.3%.

Implications

Open-weights ecosystem health. Visual document retrieval without OCR is a meaningful capability gap closure — most enterprise document search still depends on extraction pipelines that fail on complex layouts, tables, and images. A 2B model that handles five languages and uses Matryoshka embeddings (1024-dim at 99% quality retention) is deployable on modest hardware.

HF as open-source ML hub. The accompanying vdr-multilingual-train dataset (500k synthetic query-image pairs from ~50k public PDFs) is 10x larger than prior open-source equivalents — released alongside the model in a way that makes the full training pipeline reproducible. This is the HF ecosystem functioning as intended: model, dataset, and training methodology published together.

← all signals