2025-10-16 · HuggingFace

Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face

modelsinfrastructure

read at source ↗ huggingface.co

Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face

Source: HuggingFace Date: 2025-10-16 URL: https://huggingface.co/blog/gpt-oss-on-intel-xeon

Summary

Benchmark/performance analysis: Intel, Google Cloud, and HF benchmark GPT OSS (120B MoE) on C4 VMs (Intel Xeon 6/Granite Rapids) vs C3 (4th Gen Xeon). Key optimization: merged expert execution improvement to Transformers (PR #40304) routing tokens only to assigned experts, eliminating redundant computation. Results: 1.4-1.7x throughput per vCPU across batch sizes 1-64, 1.7x TCO improvement on C4 vs C3. Tested at BF16, 1024 in + 1024 out tokens. The headline “70% TCO improvement” is the C4/C3 hardware generation gain, not a software optimization number.

Implications

Open-weights ecosystem health. The expert routing optimization (Transformers PR #40304) is a correctness fix that also improves efficiency — routing tokens only to their assigned expert rather than all experts is what MoE is supposed to do. The fact this was missing from Transformers’ MoE implementation and needed to be merged in underscores that Transformers’ MoE support was incomplete for the GPT OSS model class.

HF as open-source ML hub. Intel and Google Cloud publishing TCO benchmarks for open-weights models through HF’s blog is the hardware ecosystem validating open-weights inference economics at scale. A 1.7x TCO improvement from a hardware generation change without any model changes is a compelling argument for teams running GPT OSS in production to evaluate C4 instances.

← all signals