2026-03-03 · HuggingFace

PRX Part 3 — Training a Text-to-Image Model in 24h!

pricingresearch

PRX Part 3 — Training a Text-to-Image Model in 24h!

Source: HuggingFace Date: 2026-03-03 URL: https://huggingface.co/blog/Photoroom/prx-part3

Summary

Research/code release from Photoroom (PRX series Part 3): a practical recipe for training a competitive text-to-image model in 24 hours on 32 H200 GPUs at ~$1,500 total cost. Key techniques stacked: pixel-space training (no VAE), TREAD token routing (50% token skip), REPA alignment with DINOv3, perceptual losses (LPIPS + DINOv2), and the Muon optimizer. Trained on 8.7M images. Full code released at github.com/Photoroom/PRX.

Implications

Thread: open-weights ecosystem health / model release cadence. The $1,500 / 24-hour threshold is a democratization signal — custom text-to-image training is moving within reach of small teams and well-funded individuals. The pixel-space approach (no VAE) is interesting from an architecture standpoint: it trades the compression benefits of latent diffusion for architectural simplicity. The TREAD + REPA combination is worth watching as a training efficiency pattern that may propagate to other diffusion architectures. No FID/IS benchmarks is a gap in the claims.

← all signals