Make your ZeroGPU Spaces go brrr with ahead-of-time compilation
read at source ↗ huggingface.co
Make your ZeroGPU Spaces go brrr with ahead-of-time compilation
Source: HuggingFace Date: 2025-09-02 URL: https://huggingface.co/blog/zerogpu-aoti
Summary
Integration tutorial: PyTorch ahead-of-time (AoT) compilation for HF ZeroGPU Spaces, addressing the problem that JIT compilation (torch.compile) is wasted when ZeroGPU kills processes between tasks. Pipeline: capture example inputs → export model → compile once with spaces.aoti_compile → reload instantly in new processes. Speedup on FLUX.1-dev: 1.75x with AoTI+FlashAttention3, 1.7x with AoTI alone. Regional compilation (per repeated block): full model in ~6min, regional in ~30sec with identical speedup. Compiled graphs serializable to Hub.
Implications
HF as open-source ML hub. ZeroGPU’s ephemeral GPU allocation model is the right architecture for shared compute, but it punishes JIT compilation — AoT compilation makes ZeroGPU Spaces competitive with persistent inference endpoints for latency-sensitive demos. Storing compiled graphs on Hub and reloading them is a workflow that makes ZeroGPU viable for production-quality demos at scale.
Open-weights ecosystem health. 1.75x throughput on FLUX.1-dev without any model changes, just compilation strategy, is a meaningful operator-level optimization that requires no ML expertise. The FlashAttention-3 integration path and FP8 quantization support make this a near-complete inference optimization stack for image generation at ZeroGPU compute costs.