2026-02-03 · HuggingFace

Training Design for Text-to-Image Models: Lessons from Ablations

research

read at source ↗ huggingface.co

Training Design for Text-to-Image Models: Lessons from Ablations

Source: HuggingFace Date: 2026-02-03 URL: https://huggingface.co/blog/Photoroom/prx-part2

Summary

Photoroom’s ablation log for their PRX text-to-image model documents which recent training innovations actually move metrics against a clean Flow Matching baseline, and by how much. The headline findings: latent space quality (tokenizer choice) matters as much as denoiser architecture, BF16 weight storage causes silent severe degradation (FID 18.2 → 21.87), and long descriptive captions dwarf most architectural interventions (FID 18.2 vs. 36.84 for short captions). REPA alignment is a strong early-training accelerant but should be disabled after ~200K steps to avoid constraining fine-detail learning.

Implications

  • Feeds the model training practice thread: the result that caption richness rivals architecture choices is a broadly transferable lesson — applicable anywhere instruction-following quality is being optimised.
  • The BF16 silent-degradation finding is directly actionable for any team running mixed-precision training without careful benchmarking; it illustrates that “standard practice” defaults can carry hidden costs that only ablations surface.
  • The phased training recipe (synthetic data → real, alignment early then off, targeted SFT last) is a concrete template for teams building specialized generative models, reducing the search space for training design decisions.

← all signals