2025-03-28 · HuggingFace

🚀 Accelerating LLM Inference with TGI on Intel Gaudi

modelsinfrastructure

🚀 Accelerating LLM Inference with TGI on Intel Gaudi

Source: HuggingFace Date: 2025-03-28 URL: https://huggingface.co/blog/intel-gaudi-backend-for-tgi

Summary

Library update: Intel Gaudi support merged into TGI mainline (PR #3091), replacing the separate tgi-gaudi fork. Supports Gaudi1 (AWS DL1), Gaudi2 (Intel Tiber, Denvr), and Gaudi3 (Intel Tiber, IBM Cloud, Dell, HP, Supermicro). Features: dynamic batching, streaming, multi-card sharding, FP8 quantization. Optimized for Llama, Mistral, Qwen2, and others.

Implications

Thread: transformers library trajectory / open-weights ecosystem health. Gaudi landing in TGI mainline rather than a fork is an important step: it means Intel Gaudi becomes a first-class TGI deployment target that stays current with TGI releases automatically. The multi-backend TGI architecture that makes this possible reduces the maintenance burden for non-NVIDIA hardware support generally. With Gaudi3 now available on IBM Cloud, Dell, and HP — not just Intel’s own cloud — the distribution footprint is meaningful. Cost-competitive Gaudi inference for specific workloads is a credible alternative path for enterprises seeking NVIDIA independence.

← all signals