• Author(s) : Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang

This paper investigates the ability of Large Vision-Language Models (LVLMs) to generate detailed and accurate descriptions of visual content. While LVLMs have become increasingly sophisticated in their ability to process and integrate visual and textual data, a less explored area is their potential to create fine-grained descriptions. This research addresses this gap in knowledge by examining how effectively LVLMs can distinguish between similar objects and produce descriptions that capture visual details with high fidelity. The study focuses on two key aspects of description quality: distinctiveness and fidelity. Distinctiveness refers to a model’s ability to differentiate between similar objects and generate descriptions that highlight their unique characteristics. Fidelity refers to the accuracy and completeness of the descriptions in capturing the visual information presented.

To evaluate these aspects, the researchers developed a novel framework called Textual Retrieval-Augmented Classification (TRAC). TRAC leverages the generative capabilities of LVLMs to provide a more nuanced understanding of their ability to generate fine-grained descriptions. The framework functions by prompting the LVLMs to generate descriptions for a set of images. These descriptions are then compared to a retrieval database of existing text descriptions. By analyzing how well the generated descriptions match the existing descriptions, TRAC can assess the distinctiveness and fidelity of the LVLMs’ output.

The findings of this research offer valuable insights into the quality of descriptions generated by LVLMs, ultimately contributing to a broader understanding of multimodal language models. Notably, MiniGPT-4 emerged as the superior model in generating fine-grained descriptions, outperforming the other two models under evaluation. This finding suggests that MiniGPT-4 may be better equipped to handle tasks that require a high degree of precision in visual description, such as image captioning or automatic image tagging. By highlighting the strengths and weaknesses of different LVLMs in fine-grained description generation, this research paves the way for further development and refinement of these models.