2024-07-18 · HuggingFace

Docmatix - a huge dataset for Document Visual Question Answering

protocolsmodelsresearch

Docmatix - a huge dataset for Document Visual Question Answering

Source: HuggingFace Date: 2024-07-18 URL: https://huggingface.co/blog/docmatix

Summary

Dataset release: Docmatix — 2.4M images, 9.5M Q&A pairs from 1.3M PDF documents, 240x larger than prior DocVQA datasets. Generated from PDFA (2.1M OCR’d PDFs) using Phi-3-small for Q&A pair synthesis; 15% filtered as hallucinations. Key result: Florence-2 (700M) fine-tuned on 20% of Docmatix images / 4% of Q&A pairs scores 71.4 ANLS on DocVQA vs 60.1 for the same model on standard DocVQA dataset (~20% relative improvement). Idefics2 (8B) scores 74.0 ANLS.

Implications

Open-weights ecosystem health. A 20% relative DocVQA improvement from training on a fraction of Docmatix (vs full standard DocVQA) demonstrates that dataset scale dominates dataset curation for document understanding tasks. The Phi-3-small Q&A generation + 15% hallucination filtering pipeline is a reproducible synthetic data recipe for document QA at scale.

HF as open-source ML hub. Docmatix at 240x the size of prior open DocVQA datasets is a meaningful contribution to the open training data ecosystem for document VLMs — without it, document understanding fine-tuning was limited by scarcity of labeled DocVQA examples. Florence-2 closing most of the gap to an 8B model at 700M parameters on this data validates both the dataset quality and the data-efficiency advantage of large synthetic training corpora.

← all signals