2024-08-27 · HuggingFace

Scaling robotics datasets with video encoding

research

Scaling robotics datasets with video encoding

Source: HuggingFace Date: 2024-08-27 URL: https://huggingface.co/blog/video-encoding

Summary

The Hugging Face LeRobot project switched from storing robotics observation frames as individual PNGs to video-encoded formats (defaulting to AV1), using temporal compression to store only frame deltas rather than full images. The practical outcome: datasets compress to roughly 14% of original size on average, some simulated environments reach 0.2% (a 72.5GB dataset compresses to 2.9GB), and sequential frame loading is 25–50% faster than PNG equivalents. Critically, policy models trained on the compressed datasets match the performance of those trained on the originals—the compression is lossless where it matters.

Implications

Feeds the context management divergence thread at the data-infrastructure layer: the same problem that TurboQuant addresses for KV cache (redundant storage of similar state) is solved here for robotics visual data using video codec techniques. Temporal compression as a general principle is appearing across multiple layers of the stack simultaneously.
Relevant to scaling open-weight robotics models: the dataset size problem was a practical barrier to assembling large-scale robotics training sets analogous to text corpora. 14x compression makes that tractable on the same infrastructure already used for LLMs.
The AV1/H.265/H.264 codec comparison work is immediately applicable to any pipeline storing sequential visual data—not just robotics. Screen recording, browser automation traces, and agent action replays face the same redundancy problem.

← all signals