Finally, a Replacement for BERT: Introducing ModernBERT
read at source ↗ huggingface.co
Finally, a Replacement for BERT: Introducing ModernBERT
Source: HuggingFace Date: 2024-12-19 URL: https://huggingface.co/blog/modernbert
Summary
Model release: ModernBERT (Answer.AI + LightOn) — encoder-only model in two sizes, ModernBERT-Base (149M) and ModernBERT-Large (395M), trained on 2 trillion tokens including code and scientific text. Context window 8192 tokens (vs BERT’s 512). Benchmarks: beats DeBERTaV3 on GLUE and code retrieval tasks; ColBERT retrieval +9 points vs DeBERTaV3 on BEIR; inference 2x faster than DeBERTaV3 on GPU. Architecture improvements: rotary positional embeddings, Flash Attention 2, alternating global/local attention, no token-type embeddings.
Implications
Open-weights ecosystem health. ModernBERT is a direct replacement for the BERT/DeBERTa encoder family that underlies most production classification, retrieval, and reranking pipelines. The +9pt ColBERT improvement on BEIR is the number that matters for retrieval system operators — existing pipelines using DeBERTaV3 as a reranker can upgrade with no architectural changes and expect meaningful recall improvements. Code training inclusion makes it immediately relevant for software retrieval tasks where BERT-class models were previously not competitive.
Transformers library trajectory. The 2x inference speedup vs DeBERTaV3 comes from architectural choices (Flash Attention 2, alternating local attention) rather than quantization, meaning it’s available at full precision without post-training optimization steps. For teams running encoder inference at scale — embedding services, reranking APIs — this is a throughput improvement that arrives by swapping the checkpoint, not by retooling the stack.