2024-12-05 · HuggingFace

Welcome PaliGemma 2 – New vision language models by Google

models

Welcome PaliGemma 2 – New vision language models by Google

Source: HuggingFace Date: 2024-12-05 URL: https://huggingface.co/blog/paligemma2

Summary

Model release: PaliGemma 2, Google’s vision-language model family pairing the SigLIP vision encoder with a Gemma 2 language decoder. Three sizes (3B, 10B, 28B) at three resolutions (224, 448, 896), yielding 9 pre-trained checkpoints. DOCCI fine-tuned captioning: 3B at NES 28.4 and 10B at NES 20.3 (lower is better for factual accuracy). TextVQA accuracy at 3B/224: 60.04% (bfloat16), 59.78% (8-bit), 58.72% (4-bit) — minimal quantization degradation. Gemma license allows redistribution, commercial use, fine-tuning.

Implications

Open-weights ecosystem health. PaliGemma 2 at 3B-28B with Gemma 2 decoder quality is Google’s most capable open vision-language family at the time of release. The 28B size with 896×896 resolution targets document and high-resolution image understanding — a segment where open-weights models were previously weak. Minimal 4-bit quantization degradation (1.3% on TextVQA) makes it practical for edge deployment.

HF as open-source ML hub. Google releasing PaliGemma 2 with same-day Transformers integration (v4.47+), fine-tuning scripts, and a live demo on HF cements HF as the distribution layer for Google’s open research releases — the same pattern as Gemma 2 and subsequent Gemma releases.

← all signals