2026-04-16 · HuggingFace

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

infrastructure

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Source: HuggingFace Date: 2026-04-16 URL: https://huggingface.co/blog/train-multimodal-sentence-transformers

Summary

Library tutorial for training and finetuning multimodal embedding and reranker models in Sentence Transformers. Key result: finetuning Qwen3-VL-Embedding-2B on Visual Document Retrieval data improved NDCG@10 from 0.888 to 0.947, beating Qwen3-VL-Embedding-8B (0.923) at 4x less compute. Matryoshka embedding training maintains 99.7% performance at 512 dimensions (4x smaller) and 92.4% at 64 dimensions (32x smaller). Full training script using SentenceTransformerTrainer included.

Implications

Thread: transformers library trajectory / open-weights ecosystem health. Finetuned 2B beating base 8B on domain tasks is the key signal here: multimodal embedding is becoming a fine-tunable discipline, not just a foundation model selection problem. Matryoshka multimodal embeddings are particularly relevant for production — dimension compression at near-zero quality loss changes the storage and index economics of multimodal RAG. This is the companion training guide to the earlier inference-focused multimodal Sentence Transformers post; together they represent a full pipeline for custom multimodal retrieval.

← all signals