2025-02-19 · HuggingFace

PaliGemma 2 Mix - New Instruction Vision Language Models by Google

protocolsmodels

PaliGemma 2 Mix - New Instruction Vision Language Models by Google

Source: HuggingFace Date: 2025-02-19 URL: https://huggingface.co/blog/paligemma2mix

Summary

Model release: PaliGemma 2 Mix — Google’s instruction-tuned VLMs fine-tuned on a mixture of vision-language tasks in 3B, 10B, and 28B sizes at 224px/448px resolution. Task coverage: VQA, document understanding (charts, infographics), OCR/text recognition, and localization (detection, segmentation). Qualitative results demonstrated on counting, text extraction, and document QA tasks. No standard benchmark numbers — the post frames these as fine-tuning demonstrations of PaliGemma 2’s pretrained checkpoints, not general-purpose chat models.

Implications

Open-weights ecosystem health. PaliGemma 2 Mix released as fine-tuning showcases rather than production models is an interesting framing — it positions the pretrained checkpoints as the primary artifact and the mix models as evidence of what’s achievable, inviting the community to fine-tune their own task-specific variants. This is distinct from the “one model does everything” approach and is more honest about task-specific model design.

Model release cadence. The 28B size at 448px resolution is the capability frontier for document understanding tasks in the open-weights VLM family. Google releasing 3B/10B/28B fine-tunes with training code available means the fine-tuning methodology is reproducible — particularly relevant for teams doing document extraction where PaliGemma’s text recognition capabilities are strong relative to general-purpose VLMs.

← all signals