2026-04-21 · HuggingFace

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

modelsresearch

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Source: HuggingFace Date: 2026-04-21 URL: https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard

Summary

Benchmark release: QIMMA, TII’s quality-first Arabic LLM leaderboard covering 52k+ samples across 109 subsets from 14 benchmarks. Two-stage validation: automated scoring by Qwen3-235B and DeepSeek-V3 against a 10-point rubric, then human review by native Arabic speakers. Found 3.1% discard rate on ArabicMMLU (436/14163 samples removed). 88% of HumanEval+ and 81% of MBPP+ code prompts refined. 99% native Arabic content. First Arabic leaderboard with code evaluation. Top model: Qwen3.5-397B-A17B-FP8 at 68.06; Arabic-specialized models lead on cultural tasks, multilingual models lead on coding.

Implications

Model release cadence (regional models). QIMMA’s finding that 3.1% of ArabicMMLU samples are discardable as flawed is a calibration warning for everyone reporting Arabic benchmark scores — models trained to beat these benchmarks are partially optimizing against incorrect gold answers. The weak correlation between model size and Arabic performance corroborates what BenCzechMark found for Czech: multilingual capability doesn’t scale smoothly with English benchmark rankings.

Open-weights ecosystem health. Arabic-specialized models outperforming multilingual giants on cultural and linguistic tasks while multilingual models lead on coding confirms a task-specific capability split that should inform model selection for Arabic NLP applications. TII releasing the validation code, per-sample model outputs, and paper together makes this a reproducible research contribution, not just a leaderboard.

← all signals