2024-08-27 · HuggingFace

Scaling robotics datasets with video encoding

research

read at source ↗ huggingface.co

Scaling robotics datasets with video encoding

Source: HuggingFace Date: 2024-08-27 URL: https://huggingface.co/blog/video-encoding

Summary

The Hugging Face LeRobot project switched from storing robotics observation frames as individual PNGs to video-encoded formats (defaulting to AV1), using temporal compression to store only frame deltas rather than full images. The practical outcome: datasets compress to roughly 14% of original size on average, some simulated environments reach 0.2% (a 72.5GB dataset compresses to 2.9GB), and sequential frame loading is 25–50% faster than PNG equivalents. Critically, policy models trained on the compressed datasets match the performance of those trained on the originals—the compression is lossless where it matters.

Implications

  • Feeds the context management divergence thread at the data-infrastructure layer: the same problem that TurboQuant addresses for KV cache (redundant storage of similar state) is solved here for robotics visual data using video codec techniques. Temporal compression as a general principle is appearing across multiple layers of the stack simultaneously.
  • Relevant to scaling open-weight robotics models: the dataset size problem was a practical barrier to assembling large-scale robotics training sets analogous to text corpora. 14x compression makes that tractable on the same infrastructure already used for LLMs.
  • The AV1/H.265/H.264 codec comparison work is immediately applicable to any pipeline storing sequential visual data—not just robotics. Screen recording, browser automation traces, and agent action replays face the same redundancy problem.

← all signals