2025-07-09 · HuggingFace

Creating custom kernels for the AMD MI300

modelsinfrastructure

Creating custom kernels for the AMD MI300

Source: HuggingFace Date: 2025-07-09 URL: https://huggingface.co/blog/mi300kernels

Summary

Technical optimization post + library release: HF and AMD co-developed three custom ROCm kernels for Llama 3.1 405B inference in FP8 on MI300X — fused RMS Norm (11.2x faster than PyTorch baseline, 25-31% faster than vLLM), SwiGLU (avg 14x faster than PyTorch, up to 100% faster than vLLM at batch size 1), and Skinny GEMM for low-batch decoding (up to 141% faster than vLLM). Combined: ~40% end-to-end latency reduction for vLLM serving on 8×MI300X. Open-sourced as hf-rocm-kernels with benchmarking scripts.

Implications

Open-weights ecosystem health. A 40% latency reduction for 405B inference on MI300X is a significant result — it makes AMD hardware a viable alternative to H100 for frontier-scale open-weights serving, not just a cheaper-but-slower option. If this kernel library is adopted by the vLLM community, it shifts the AMD cost advantage from theoretical to practical.

HF as open-source ML hub. HF publishing deep AMD kernel optimization work reinforces its role as infrastructure partner for the open-weights ecosystem, not just a model repository. The educational depth of the post (coalesced loads, warp specialization, sparse tensor core tricks) makes it a reference document for the growing community of custom kernel developers on AMD hardware.

← all signals