2026-01-27 · HuggingFace

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

modelsresearch

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

Source: HuggingFace Date: 2026-01-27 URL: https://huggingface.co/blog/tiiuae/emirati-benchmarks

Summary

Research benchmark release: Alyah, a 1,173-sample benchmark for evaluating Arabic LLMs on Emirati dialect understanding across 7 categories (greetings, religious/social, figurative meaning, etiquette, poetry, heritage, language/dialect). 54 models evaluated (23 base, 31 instruction-tuned). Top results: Falcon-H1-Arabic-7B-Instruct leads at 82.18%, ALLaM-7B-Instruct at 77.24%, Qwen2.5-72B-Instruct at 74.6%. Key finding: instruction-tuned models outperform base by 5-10%; “Language & Dialect” and “Greetings” are hardest due to scarcity of written Emirati dialect in training data.

Implications

Open-weights ecosystem health. Falcon-H1-Arabic-7B-Instruct at 82% on a culturally specific Emirati benchmark while being a 7B model is a strong result — it demonstrates that Arabic-native models with targeted training outperform much larger multilingual models (Qwen2.5-72B at 74.6%) on regional dialects. This pattern will likely repeat across other under-represented language variants as benchmarks become available.

Model release cadence — regional models. TII UAE’s Alyah and Falcon-H1-Arabic signals an accelerating pattern: Gulf state AI labs building both the models and the benchmarks for their regional languages simultaneously. The benchmark precedes broader adoption by validating the model’s real-world capability in a way that MSA-only evaluation cannot.

← all signals