Serverless Inference with Hugging Face and NVIDIA NIM
read at source ↗ huggingface.co
Serverless Inference with Hugging Face and NVIDIA NIM
Source: HuggingFace Date: 2024-07-29 URL: https://huggingface.co/blog/inference-dgx-cloud
Summary
Integration announcement (now deprecated): HuggingFace launched serverless inference via NVIDIA NIM API on DGX Cloud H100s for Enterprise Hub organizations. Pay-as-you-go at $8.25/hour per H100; OpenAI-compatible API. Benchmark pricing: Llama 3 8B at $0.0023/request (1 H100, 1 sec), Llama 3 70B at $0.0184/request (4 H100s, 2 sec), Llama 3.1 405B-FP8 at $0.0917/request (8 H100s, 5 sec). Service was deprecated April 10, 2025.
Implications
HF as open-source ML hub. Even as a deprecated service, this announcement is a data point: HF tested the enterprise inference market directly via NVIDIA partnership before pulling back. The deprecation likely reflects either insufficient Enterprise Hub demand at those price points or a strategic pivot to the broader Inference Providers model (Groq, Together, etc.) rather than operating infrastructure directly.
Open-weights ecosystem health. The pricing benchmarks (Llama 405B at $0.09/request) established a reference point for the cost of frontier-tier open-weights inference on H100 hardware. That baseline is useful for comparing against subsequent inference provider pricing as the market has evolved.