Memory-efficient Diffusion Transformers with Quanto and Diffusers
read at source ↗ huggingface.co
Memory-efficient Diffusion Transformers with Quanto and Diffusers
Source: HuggingFace Date: 2024-07-30 URL: https://huggingface.co/blog/quanto-diffusers
Summary
Integration tutorial for Quanto + Diffusers quantization on transformer-based diffusion models (PixArt-Sigma, Stable Diffusion 3). Key benchmarks on PixArt-Sigma: FP8 transformer + text encoder reduces memory from 12.09GB to 5.36GB (-56%) at minimal latency cost; INT4 reaches 3.06GB (-75%) but at 7.6x latency increase. SD3: first + third text encoders at FP8 cuts 16.4GB to 8.2GB. Practical caveats: don’t quantize VAE decoder (numerical instability); exclude final projection layer from INT4 (quality degradation).
Implications
Thread: transformers library trajectory / open-weights ecosystem health. This is the diffusion model equivalent of the LLM quantization guides — practical benchmarks that make the memory/quality/latency tradeoffs legible for practitioners. The 50-75% memory reduction at FP8 is particularly relevant: it puts SD3 and PixArt-Sigma on consumer 8-12GB GPUs. The SD3 “don’t quantize text encoder 2” finding is production operational knowledge that isn’t obvious from the library documentation. Now superseded partially by the later Diffusers native quantization backends post, but still valuable for Quanto-specific workflows.