• Author(s): Tianchen Zhao, Tongcheng Fang, Enshu Liu, Wan Rui, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

The paper introduces the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a new benchmark designed to rigorously evaluate the performance of Video Large Multi-modal Models (Video-LMMs) across various real-world video contexts. Recent advancements have enabled these models to support diverse applications, including robotics, AI assistants, medical imaging, and autonomous vehicles, making their integration into daily life increasingly significant. This integration highlights the necessity for these models to exhibit human-like reasoning and robust interaction capabilities within complex, real-world environments.

Current benchmarks for Video-LMMs mainly assess general video comprehension and often overlook the evaluation of complex video reasoning and the robustness of the models in response to user-generated text prompts. This paper addresses this gap by testing the capabilities of Video-LMMs to handle intricate video content through CVRR-ES, which covers 11 distinct real-world video dimensions. The evaluation includes nine recent Video-LMMs, both open-source and closed-source, revealing that particularly open-source models frequently falter in robustness and reasoning with complex videos.

In response to these findings, the authors propose a novel, training-free Dual-Step Contextual Prompting (DSCP) technique aimed at enhancing the performance of existing Video-LMMs. This technique, along with the insights gathered from the comprehensive testing, paves the way for the development of next-generation human-centric AI systems, equipped with improved robustness and advanced reasoning abilities. The paper provides crucial insights and tools for researchers and developers aiming to refine AI interactions in complex video environments.