• Author(s) : Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

Generating high-quality audio descriptions (AD) for movies is a complex task that demands intricate visual comprehension and an awareness of characters and their identities. Current visual language models designed for AD generation face limitations due to a scarcity of suitable training data and a lack of specialized evaluation measures tailored to the AD.

This paper presents a comprehensive approach to address these challenges. First, the authors propose two innovative methods for constructing AD datasets with aligned video data, which will be made publicly available to support further advancements in the field. Second, they introduce a Q-former-based architecture that ingests raw video footage and generates AD by leveraging frozen pre-trained visual encoders and large language models. Third, they provide new evaluation metrics specifically designed to benchmark AD quality, aligning with human performance assessments.

By combining these contributions, the authors achieve state-of-the-art performance in audio description generation for movies. The proposed datasets, models, and evaluation metrics pave the way for more accurate and engaging audio descriptions, enhancing the accessibility and enjoyment of visual media for individuals with visual impairments.