• Author(s) : Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna

The paper explores a novel approach to enhance the semantic grounding abilities of Vision-Language Models (VLMs) without relying on domain-specific training data, fine-tuning, or modifications to the network architectures. The authors propose a feedback mechanism composed of a binary signal, which, when prompted appropriately, allows VLMs to utilize feedback both in a single step and iteratively. This approach showcases the potential of feedback as an alternative technique to improve grounding in internet-scale VLMs.

The study also addresses the issue of VLMs struggling to self-correct errors out-of-the-box, similar to Language Models (LLMs). To mitigate this problem, the authors introduce a binary verification mechanism that enables VLMs to identify and correct their mistakes.

Furthermore, the paper explores the potential and limitations of combining these findings and applying them iteratively to automatically enhance VLMs’ grounding performance. The results demonstrate that grounding accuracy consistently improves using automated feedback across all models in all settings investigated. The iterative framework proposed in this work improves semantic grounding in VLMs by more than 15 accuracy points under noise-free feedback and up to 5 accuracy points under a simple automated binary verification mechanism.

Overall, this work presents a promising direction for enhancing the semantic grounding abilities of VLMs without the need for additional training data or architectural modifications. The proposed feedback mechanism and binary verification system offer a novel and effective approach to improving the performance of VLMs in various settings.