2025-01-09 · HuggingFace

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

modelsresearch

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

Source: HuggingFace Date: 2025-01-09 URL: https://huggingface.co/blog/leaderboard-emissions-analysis

Summary

Research summary: analysis of CO₂ emissions from evaluating 2,742+ language models on the Open LLM Leaderboard since June 2024. Key findings: emissions scale with model size but with diminishing performance returns; MoE models show poor score-to-emission ratios due to long inference times; community fine-tuned models are more CO₂-efficient than the official base models they derive from. Best official efficiency: Qwen-2.5-14B and Phi-3-Medium. Community models under 10B can achieve leaderboard scores of 35 with under 5kg CO₂ per evaluation run.

Implications

Open-weights ecosystem health. The finding that community fine-tunes are more efficient than the official models they’re based on is striking — it suggests that the fine-tuning and merging community is implicitly doing efficiency optimization without targeting it directly. This is a data point for the argument that the open-weights ecosystem produces better cost-per-quality solutions than vendor-shipped models at the same scale.

Model release cadence. MoE models having poor score-to-emission ratios despite competitive benchmark scores is a practical argument against MoE dominance at the inference edge. As emissions accounting becomes more common in enterprise procurement, this could slow MoE adoption in production relative to dense models of comparable quality.

← all signals