Author(s) : Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu

The visual encoder plays a crucial role in determining the performance of multimodal large language models (MLLMs) in understanding diverse image content. While large-scale pretrained vision encoders, such as those in CLIP and DINOv2, have shown promising results, no single vision encoder consistently excels across various image content types. For example, the CLIP vision encoder demonstrates outstanding performance in general image understanding but struggles with document or chart content.

To address the limitations of the CLIP vision encoder, the authors propose MoVA, an innovative and powerful MLLM that adaptively routes and fuses task-specific vision experts using a coarse-to-fine approach. In the coarse-grained stage, MoVA employs a context-aware expert routing strategy to dynamically select the most appropriate vision experts based on the user instruction, input image, and the expertise of the vision experts. This strategy leverages the robust model function understanding capability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA).

In the fine-grained stage, MoVA utilizes a mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively harnesses representations from experts based on multimodal context and model expertise, further enhancing the generalization ability of the MLLM.

The effectiveness of MoVA is validated through extensive experiments on a wide range of challenging multimodal benchmarks. Without any additional techniques, MoVA achieves significant performance improvements over current state-of-the-art methods, demonstrating its superior ability to understand and process diverse image content.

In conclusion, MoVA represents a significant advancement in the field of multimodal large language models, offering a powerful and adaptable solution for enhancing image content understanding by leveraging the strengths of multiple vision experts through an innovative coarse-to-fine approach.

MoVA: Adapting Mixture of Vision Experts to Multimodal Context