2025-05-13 · HuggingFace

Blazingly fast whisper transcriptions with Inference Endpoints

protocolsinfrastructure

Blazingly fast whisper transcriptions with Inference Endpoints

Source: HuggingFace Date: 2025-05-13 URL: https://huggingface.co/blog/fast-whisper-endpoints

Summary

Service release: Optimized Whisper transcription endpoints on HF Inference Endpoints using vLLM + NVIDIA Ada Lovelace GPUs (L4, L40s) with PyTorch compilation, CUDA graphs, and float8 KV cache quantization. ~8x Real-Time Factor improvement vs Transformers baseline for Whisper Large V3, Large V3-Turbo, and Distil-Whisper Large V3.5 on L4. WER maintained at parity across 8 standard ASR benchmarks (LibriSpeech, Earnings22, etc.). OpenAI-compatible API endpoint.

Implications

HF as open-source ML hub. 8x faster Whisper throughput via vLLM on HF Inference Endpoints makes production-scale transcription economically viable at a cost-per-audio-hour that competes with dedicated ASR APIs. The OpenAI-compatible endpoint means existing applications using OpenAI Whisper API can switch to HF endpoints without code changes.

Open-weights ecosystem health. vLLM being the inference backend for ASR (not just LLM generation) reflects vLLM’s growing role as a general-purpose high-performance inference engine beyond its original LLM focus. float8 KV cache quantization for Whisper is a novel optimization path — the encoder-decoder architecture benefits from the same quantization techniques as decoder-only models when attention cache is the memory bottleneck.

← all signals