Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
- Published on June 25, 2024 6:07 am
- Editor: Yuvraj Singh
- Author(s): Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, Hongyang Zhang
“Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models” addresses the evolving landscape of Referring Expression Comprehension (REC) in light of advancements in large multimodal models (LMMs). REC is a task that involves identifying and localizing objects in images based on natural language descriptions. Traditional REC methods have been limited by their reliance on single-target expressions and their inability to handle more complex scenarios involving multiple or no targets.
The authors introduce a new benchmark, Generalized Referring Expression Comprehension (GREC), which expands the scope of REC to include multi-target and no-target expressions. This benchmark is designed to better reflect real-world applications where expressions can refer to multiple objects or none at all. GREC allows for a more versatile and comprehensive evaluation of REC models, pushing the boundaries of what these models can achieve.
To support the development and evaluation of REC models under this new benchmark, the authors have created a dataset named gRefCOCO. This dataset extends the existing RefCOCO dataset by including samples that involve multiple targets and no targets. The inclusion of these new types of samples aims to challenge current models and drive innovation in the field. The paper also introduces Geno Agents, a team of LLM-based agents designed to collaboratively explore gene datasets. These agents are equipped with context-aware planning, iterative correction, and domain expert consultation capabilities. The Geno Agents framework demonstrates the potential of LLM-based approaches in genomics data analysis, highlighting both their strengths and areas for future improvement through detailed error analysis. Experimental results show that current state-of-the-art REC techniques struggle with the GREC benchmark, primarily due to their inherent assumption of a single target object. This limitation underscores the need for new methods that can handle the complexity and variability of real-world referring expressions.
“Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models” presents a significant advancement in the evaluation of REC models. By introducing the GREC benchmark and the gRefCOCO dataset, the authors provide a more comprehensive framework for assessing the capabilities of REC models. This research highlights the limitations of current methods and sets the stage for future developments in the field, ultimately aiming to improve the accuracy and robustness of REC models in practical applications.