2025-02-10 · HuggingFace

The Open Arabic LLM Leaderboard 2

modelsresearch

The Open Arabic LLM Leaderboard 2

Source: HuggingFace Date: 2025-02-10 URL: https://huggingface.co/blog/leaderboard-arabic-v2

Summary

Leaderboard update announcing OALL v2, the second generation of the Open Arabic LLM Leaderboard. The revision removes saturated and machine-translated benchmarks, adds native Arabic evaluations (Arabic MMLU, AraTrust, ALRAGE with LLM-as-judge scoring), and fixes an AlGhafa evaluation bug that was causing 20-point score swings. Top performers: Qwen2.5-72B for pretrained, Llama-3.3-70B-Instruct for chat.

Implications

Thread: open-weights ecosystem health / HF as open-source ML hub. The leaderboard reaching 700+ model submissions from 180+ organizations signals that Arabic NLP has become a real competitive arena, not a checkmark. The shift to native Arabic evaluation and LLM-as-judge (Qwen2.5-72B) over MCQ-only scoring reflects a broader trend: leaderboards graduating from task saturation to generative quality measurement. The ALRAGE RAG evaluation is particularly relevant — Arabic-language RAG is under-researched and this gives the community a comparison surface.

← all signals