2025-04-08 · HuggingFace

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

models

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

Source: HuggingFace Date: 2025-04-08 URL: https://huggingface.co/blog/leaderboard-3c3h-aragen-ifeval

Summary

Platform launch and benchmark release: MBZUAI + HF launch Arabic-Leaderboards, a unified hub for Arabic AI evals. Ships AraGen-03-25 (340 Q&A pairs across QA, reasoning, safety, orthography) and Arabic IFEval — the first public Arabic instruction-following benchmark adapted from Google’s IFEval with Arabic-specific linguistic constraints. Top AraGen-03-25 score: o1-2024-12-17 at 70.25% (down from 82.67% on prior version — harder benchmark). All models score 5–15% lower on Arabic prompts vs. English on instruction following.

Implications

Thread: open-weights ecosystem health. Arabic-specific benchmarks are infrastructure, not research novelty — without them, Arabic-language model quality is opaque. The Arabic IFEval result (all models ~60–73% vs. 65–88% on English) quantifies a real gap that broad multilingual claims paper over. The 3C3H framework finding that conciseness is the uncorrelated dimension is useful signal for Arabic deployment: models are more verbose in Arabic contexts. Watch whether the planned multimodal Arabic evals attract the same attention as the text benchmarks.

← all signals