• Author(s): Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein

The paper introduces CinePile, a novel dataset and benchmark specifically designed to address the limitations of current datasets in long-form video understanding. Traditional datasets often fail to provide genuine long-form comprehension challenges, as many tasks can be effectively tackled by analyzing just a few random frames from a video. CinePile aims to overcome this issue by offering a comprehensive question-answer dataset that necessitates a deeper understanding of the entire video content.

CinePile’s dataset comprises 305,000 multiple-choice questions (MCQs) that cover a wide range of visual and multimodal aspects. These include temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. The creation of this dataset involved advanced large language models (LLMs) with human-in-the-loop processes, building upon human-generated raw data to ensure the quality and relevance of the questions.

The paper also evaluates recent video-centric LLMs, both open-source and proprietary, on the test split of the CinePile dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks. This highlights the complexity and challenge inherent in long-form video understanding, underscoring the need for more advanced models and techniques to bridge this performance gap.

CinePile represents a significant advancement in the field of video understanding, providing a robust benchmark for evaluating the capabilities of video-centric LLMs. By focusing on genuine long-form comprehension challenges, CinePile aims to drive progress in the development of more sophisticated models capable of understanding complex video content. This dataset and benchmark are poised to become essential tools for researchers and developers working on video understanding and related applications.