• Author(s) : Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari

Vision-language models (VLMs) have made significant strides in recent years, combining vision encoders like CLIP with language models to tackle various downstream tasks. However, these models still face challenges due to the limitations of their vision encoders, such as inability to detect certain image features and tendency to hallucinate visual elements.

To overcome these hurdles, a group of researchers conducted a comprehensive study on expanding the visual encoding capabilities of VLMs. They benchmarked several vision encoders with different inductive biases to assess their performance on VLM tasks. Interestingly, they found that no single encoding configuration consistently outperformed others across all tasks, and encoders with different biases often achieved surprisingly similar results.

Inspired by these findings, the researchers developed BRAVE, a method that combines features from multiple frozen encoders to create a more versatile representation. This representation can be directly fed into a frozen language model, enabling BRAVE to achieve state-of-the-art performance on a wide range of captioning and visual question answering benchmarks.

One of the key advantages of BRAVE is its ability to significantly reduce the issues associated with traditional VLMs, such as blindness to certain image features and visual hallucination. Moreover, BRAVE requires fewer trainable parameters compared to existing methods and generates a more compressed representation.

The research highlights the potential benefits of incorporating different visual biases to achieve a broader and more contextualized visual understanding in VLMs. By leveraging multiple encoders, BRAVE demonstrates that it is possible to enhance the performance of these models while simultaneously addressing their limitations.

As VLMs continue to evolve and find applications in various domains, approaches like BRAVE will play a crucial role in improving their accuracy, efficiency, and robustness. This research paves the way for further advancements in the field, ultimately leading to more powerful and reliable vision-language models.