2025-03-04 · HuggingFace

A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

models

A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

Source: HuggingFace Date: 2025-03-04 URL: https://huggingface.co/blog/aya-vision

Summary

Model release from Cohere For AI: Aya Vision family (8B and 32B), open-weight multilingual VLMs supporting 23 languages. Architecture: SigLIP2-patch14-384 vision encoder + pixel shuffle downsampling + Command R7B or Aya Expanse 32B backbone. Aya Vision 8B achieves 79% win rate vs comparable-size VLMs on AyaVisionBench; 32B achieves 50-64% win rate vs models 2x its size. Accompanied by two new multilingual evaluation datasets (AyaVisionBench, mWildVision).

Implications

Thread: open-weights ecosystem health / model release cadence. Aya Vision is the only open-weight VLM family specifically designed for multilingual image understanding — most existing VLMs (Qwen2.5-VL, LLaVA, Pixtral) are English-primary with multilingual as an afterthought. The 23-language coverage and the new multilingual VLM benchmarks fill a real gap in the evaluation landscape. The model merging technique (combining base LM with fine-tuned VLM) is an interesting training efficiency approach worth tracking. Watch whether AyaVisionBench becomes adopted by other labs as the standard for multilingual VLM evaluation.

← all signals