2025-08-01 · HuggingFace

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

modelsresearch

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

Source: HuggingFace Date: 2025-08-01 URL: https://huggingface.co/blog/tiiuae/3lm-benchmark

Summary

Research summary and benchmark release: 3LM (TII/UAE) is the first Arabic STEM and code benchmark. Three components: 865 native STEM MCQs from grades 8–12 Arabic textbooks (physics, chemistry, biology, math, geography), 1,744 synthetic STEM MCQs via YourBench, and Arabic translations of HumanEval+ and MBPP+. Top STEM performer: Qwen2.5-72B-Instruct at 71.8% native / 67.0% synthetic. Code: GPT-4o 83.5% HumanEval-ar. Arabic/English code pass@1 correlation: ~0.97.

Implications

Thread: open-weights ecosystem health. The 0.97 Arabic/English code correlation is the most actionable finding: Arabic code capability tracks English capability almost perfectly, meaning code generation quality gaps are driven by base model multilingual coverage, not task-specific capability. The STEM gap is harder — Arabic educational material requires OCR + LaTeX math parsing to even create the dataset. The native STEM MCQ approach (real grades 8–12 curriculum questions) is more ecologically valid than translated benchmarks. This gives Arabic-language model developers a clear signal on where their models are weakest relative to frontier models.

← all signals