• Author(s) : Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Sha, Abhinav Shrivastava, Ser-Nam Lim

Introducing MA-LMM, a Memory-Augmented Large Multimodal Model designed to revolutionize long-term video understanding. Unlike existing LLM-based multimodal models that are limited to processing only a small number of frames from short videos, MA-LMM tackles the challenge of understanding extended video content. It achieves this by processing videos in an online manner, meaning it can analyze videos as they play, and stores important information in a memory bank.

This innovative memory bank allows MA-LMM to reference historical video content without being constrained by the context length limitations of traditional LLMs or the GPU memory capacity. The memory bank is designed to be easily integrated into current multimodal LLMs, enhancing their capabilities with minimal adjustments.

Through extensive experiments on a variety of video understanding tasks, including long-video understanding, video question answering, and video captioning, MA-LMM has proven its exceptional performance. It achieves state-of-the-art results across multiple datasets, showcasing its effectiveness and versatility in long-term video analysis. With MA-LMM, we take a significant step forward in unlocking the potential of multimodal models for comprehensive video understanding, enabling a deeper and more nuanced interpretation of video content.