2026-05-18 · HuggingFace

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

infrastructure

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

Source: HuggingFace Date: 2026-05-18 URL: https://huggingface.co/blog/nvidia/cosmos-fine-tuning-for-robot-video-generation

Summary

NVIDIA’s Cosmos Predict 2.5, a 2B-parameter diffusion world model, can be efficiently fine-tuned for robotic manipulation video generation using LoRA/DoRA adapters targeting only ~50M parameters. Training on 92 labeled robot videos for 100 epochs yields substantial improvements in physical plausibility and instruction following — measurable in under 2.5 hours on 8× H100s — without catastrophic forgetting of the base model’s general world knowledge. The adapter files are small and portable, making domain-specific fine-tuning a practical path for teams that cannot train world models from scratch.

Implications

Synthetic data flywheel for robotics. Cheap fine-tuned world models lower the cost of generating robot training data at scale, accelerating the simulation-to-real transfer loop that physical AI labs depend on. Teams that can iterate quickly on synthetic data pipelines gain a compounding advantage over those waiting on real-world collection.
Open-weight ecosystem. Cosmos Predict 2.5 is available on Hugging Face, and this tutorial ships with a fully reproducible recipe. The combination signals that world-model fine-tuning is moving from research lab to practitioner toolkit — the same trajectory embedding models took in 2022–23.
Compute accessibility. 100 epochs on a single H100 in 17 hours is within reach of startups and academic labs. The rank-8 vs rank-32 analysis gives practitioners a concrete cost/quality trade-off dial rather than a single “best” recipe.
Agent grounding thread. As agentic systems expand beyond text into physical environments, the ability to generate plausible action sequences via video prediction becomes a substrate for planning. This is an early signal that video world models are maturing from demo to infrastructure.

← all signals