Author(s) :
Juhong Min,
Shyamal Buch,
Arsha Nagrani,
Minsu Cho,
Cordelia Schmid

MoReVQA, a groundbreaking framework for video question answering (videoQA), that enhances interpretability and performance. Unlike traditional single-stage planning methods, MoReVQA employs a multi-stage, modular reasoning approach. It consists of three key stages: an event parser, a grounding stage, and a final reasoning stage, all integrated with an external memory.

What sets MoReVQA apart is that each stage operates in a training-free manner, utilizing few-shot prompting of large models. This innovative design allows MoReVQA to generate interpretable intermediate outputs at every step, providing valuable insights into the reasoning process. By decomposing the underlying planning and task complexity, MoReVQA surpasses previous methods on standard videoQA benchmarks, including NExT-QA, iVQA, EgoSchema, and ActivityNet-QA, achieving state-of-the-art results.

One of the key limitations of existing modular methods is their reliance on a single planning stage, often ungrounded in visual content. This can lead to brittle behavior, especially in complex videoQA settings. MoReVQA’s multi-stage approach effectively overcomes these challenges, enhancing the overall robustness and performance of videoQA systems.

The framework also showcases its versatility by extending its applications to related tasks, such as grounded videoQA and paragraph captioning, with impressive results. MoReVQA’s ability to effectively integrate external memory and few-shot prompting sets a new standard for interpretable and adaptable videoQA systems, pushing the boundaries of video understanding and reasoning.

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering