2025-07-01 · HuggingFace

Training and Finetuning Sparse Embedding Models with Sentence Transformers

models

read at source ↗ huggingface.co

Training and Finetuning Sparse Embedding Models with Sentence Transformers

Source: HuggingFace Date: 2025-07-01 URL: https://huggingface.co/blog/train-sparse-encoder

Summary

Integration tutorial: Full training pipeline for sparse embedding models in Sentence Transformers — covers SPLADE, inference-free SPLADE, and CSR (Contrastive Sparse Representation) architectures, with SparseEncoderTrainer API. Example fine-tunes inference-free SPLADE on Natural Questions (100k pairs). NanoMSMARCO benchmark results: sparse-only NDCG@10 52.41, dense-only 55.40, hybrid sparse+dense 66.31 — hybrid outperforms either alone by 12-19%. Trained model achieves 99.4% sparsity with ~184 active dimensions per document.

Implications

Transformers library trajectory. Sentence Transformers adding a SparseEncoderTrainer alongside its dense encoder tooling formalizes sparse retrieval as a first-class training target, not just an inference option. The hybrid result (12-19% improvement over either alone) is the key finding for RAG practitioners — teams not using hybrid retrieval are leaving substantial ranking quality on the table.

Open-weights ecosystem health. The inference-free SPLADE variant (lightweight query processing via a router) addresses the latency asymmetry in standard SPLADE where query encoding is expensive. Making sparse models trainable from any dense model checkpoint (CSR) extends the technique to domain-specific embeddings — teams can now add sparse retrieval to existing fine-tuned dense encoders without retraining from scratch.

← all signals