2024-10-23 · HuggingFace

CinePile 2.0 - making stronger datasets with adversarial refinement

modelsresearch

Source: HuggingFace Date: 2024-10-23 URL: https://huggingface.co/blog/cinepile2

Summary

Dataset release and research summary: CinePile 2.0, a long-form video QA dataset (~300k train, 5k test) derived from YouTube movie clips, with a novel Adversarial Refinement pipeline for dataset quality improvement. The method uses a “Deaf-Blind LLM” (LLaMA 3.1 70B) to detect and eliminate questions answerable without visual context; GPT-4 then rewrites them until the blind model performs at chance. Successfully refined 90.24% of degenerate test-set pairs. Best open-source model (LLaVA-OV) scores 49.34%; human performance exceeds best commercial models by ~25%.

Implications

Open-weights ecosystem health. The Adversarial Refinement methodology is generalizable — any multimodal dataset susceptible to text-only shortcuts can use this pipeline. The human-model gap (25% over Gemini 1.5 Pro on the hard split, 65% over open-source models) suggests long-form video understanding remains a wide-open research front where better training data matters more than model scale alone.

Model release cadence (multimodal). CinePile 2.0 as a benchmark reveals that most multimodal models are effectively performing text-based pattern matching on video QA. The 15-20% accuracy drop on the adversarially filtered hard split is a reliable signal that reported video understanding benchmarks are inflated by dataset artifacts — calibrate multimodal model claims accordingly.

← all signals