(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware
read at source ↗ huggingface.co
(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware
Source: HuggingFace Date: 2025-06-19 URL: https://huggingface.co/blog/flux-qlora
Summary
Integration tutorial: QLoRA fine-tuning of FLUX.1-dev (11.9B parameter image generation model) on consumer GPUs using HF Diffusers. Key memory numbers: QLoRA peaks at 9GB VRAM on RTX 4090 vs 26GB for standard BF16 LoRA vs estimated 120GB for full fine-tuning. 700 steps on RTX 4090 takes ~41 minutes; same on T4 takes ~4 hours; H100 with FP8 takes ~20 minutes at 36.57GB. Only 4.67M LoRA parameters trained vs 11.9B total.
Implications
Open-weights ecosystem health. QLoRA bringing FLUX.1-dev fine-tuning to a single consumer GPU is the diffusion equivalent of what QLoRA did for LLMs in 2023. The technique stack (4-bit quantization + 8-bit AdamW + gradient checkpointing + cached latents) is now standardized enough to fit in a tutorial — this level of accessibility will drive a wave of style-specific fine-tuned image models.
Transformers library trajectory. The Diffusers library now supports the full fine-tuning stack for major image generation models at consumer hardware scale. FP8 training via TorchAO integration is the leading edge — watch for it to become the default recommendation as H100 access expands and the 20-minute training time becomes a practical workflow target.