• Author(s): Jonathan Roberts, Kai Han, Neil Houlsby, Samuel Albanie

The paper introduces SciFIBench, a benchmark designed to evaluate the capabilities of large multimodal models (LMMs) in the domain of scientific research, specifically focusing on the interpretation of scientific figures. LMMs have demonstrated flexibility and generalizability across various tasks and fields, yet their potential in aiding scientific research remains underexplored. Understanding and interpreting figures is a crucial aspect of scientific research, as figures provide a rich, compressed source of complex information.

SciFIBench comprises a 1000-question gold set of multiple-choice questions, divided into two tasks across 12 categories. These questions are curated from figures and captions in computer science papers on arXiv. The curation process involves adversarial filtering to identify challenging negatives and human verification to ensure quality control. This rigorous process ensures that the benchmark is both comprehensive and challenging.

The paper evaluates 26 LMMs using SciFIBench, revealing that the benchmark poses significant challenges for these models. The evaluation highlights the current limitations of LMMs in understanding and reasoning about scientific figures. Additionally, the paper investigates the alignment and reasoning faithfulness of the LMMs on augmented question sets derived from the benchmark, providing further insights into the models’ performance.

SciFIBench is released to the research community to encourage progress in the development of LMMs capable of effectively interpreting scientific figures. By providing a standardized and challenging benchmark, SciFIBench aims to drive advancements in this area, ultimately enhancing the utility of LMMs in scientific research. This benchmark represents a significant step towards better understanding and improving the capabilities of LMMs in the interpretation of complex scientific information.