Build awesome datasets for video generation
read at source ↗ huggingface.co
Build awesome datasets for video generation
Source: HuggingFace Date: 2025-02-12 URL: https://huggingface.co/blog/vid_ds_scripts
Summary
Tooling guide: three-stage pipeline for curating video datasets for fine-tuning video generation models. Stage 1: acquisition via yt-dlp with scene splitting. Stage 2: frame-level filtering (watermark detection, aesthetic scoring, NSFW) and video-level motion scoring via OpenCV optical flow. Stage 3: captioning via Florence-2 and Qwen2.5-VL. Real filtering data: starting from 1,493 videos, applying watermark < 0.1 and aesthetic > 5.5 retains only 47 videos (3.15%). Code open-sourced at huggingface/video-dataset-scripts.
Implications
Open-weights ecosystem health. Video generation model fine-tuning requires high-quality curated datasets, and that curation work was previously opaque. Publishing a reproducible filtering pipeline with actual retention rate data (3% survival rate is a useful calibration) lowers the barrier for teams wanting to create domain-specific video generation datasets.
Transformers library trajectory. Florence-2 and Qwen2.5-VL used as captioning components in a dataset pipeline illustrates the ecosystem pattern: vision-language models as a pipeline step, not just an end product. This is a useful signal that multimodal models are maturing into infrastructure roles within ML workflows.