2024-09-23 · HuggingFace

FineVideo: behind the scenes

modelsresearch

FineVideo: behind the scenes

Source: HuggingFace Date: 2024-09-23 URL: https://huggingface.co/blog/fine-video

Summary

Dataset release: FineVideo, a 43K-video / 3.4K-hour annotated video dataset with scene splits, QA pairs, structured metadata (activities, objects, mood, dynamism scores), and narrative descriptions. Filtered from 1.9M YouTube-Commons videos through English filtering, dynamic content scoring, 126-category taxonomy (Llama 3.1 70B), and diversity-balanced selection. Annotation: Gemini 1.5 Pro for video understanding + GPT-4o for schema alignment at >$5/hour. Key constraint: videos over 10 minutes dropped due to Gemini quality degradation.

Implications

Open-weights ecosystem health. FineVideo is the kind of large-scale annotated video dataset that is expensive and labor-intensive to build ($5+/hour at scale means tens of thousands of dollars for 3.4K hours). Publishing it openly lowers the training data barrier for video understanding models substantially. The multi-LLM annotation pipeline (Gemini for understanding, GPT-4o for structuring) is also a reusable pattern for building structured video datasets.

HF as open-source ML hub. FineVideo landing on HF Datasets with the full curation pipeline documented continues the pattern of high-quality research datasets being published on HF as the canonical distribution point. The announced follow-up (training a multimodal LLM on FineVideo with public weights and recipes) will create a downstream open model release that references this dataset — a self-reinforcing citation loop that increases the dataset’s visibility.

← all signals