π Accelerating LLM Inference with TGI on Intel Gaudi
read at source β huggingface.co
π Accelerating LLM Inference with TGI on Intel Gaudi
Source: HuggingFace Date: 2025-03-28 URL: https://huggingface.co/blog/intel-gaudi-backend-for-tgi
Summary
Library update: Intel Gaudi support merged into TGI mainline (PR #3091), replacing the separate tgi-gaudi fork. Supports Gaudi1 (AWS DL1), Gaudi2 (Intel Tiber, Denvr), and Gaudi3 (Intel Tiber, IBM Cloud, Dell, HP, Supermicro). Features: dynamic batching, streaming, multi-card sharding, FP8 quantization. Optimized for Llama, Mistral, Qwen2, and others.
Implications
Thread: transformers library trajectory / open-weights ecosystem health. Gaudi landing in TGI mainline rather than a fork is an important step: it means Intel Gaudi becomes a first-class TGI deployment target that stays current with TGI releases automatically. The multi-backend TGI architecture that makes this possible reduces the maintenance burden for non-NVIDIA hardware support generally. With Gaudi3 now available on IBM Cloud, Dell, and HP β not just Intelβs own cloud β the distribution footprint is meaningful. Cost-competitive Gaudi inference for specific workloads is a credible alternative path for enterprises seeking NVIDIA independence.