• Author(s) : Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh

The paper titled “Idefics2: An Efficient Foundational Vision-Language Model” discusses the burgeoning interest in vision-language models (VLMs), propelled by advancements in large language models and vision transformers. Despite the wealth of research in this area, the paper notes that crucial decisions in VLM design often lack justification. This lack of substantiation hinders progress in the field by obscuring which choices enhance model performance.

To tackle this issue, the paper presents extensive experiments around pre-trained models, architecture selection, data, and training methods. The culmination of these findings is the development of Idefics2, an efficient foundational VLM with 8 billion parameters. Idefics2 delivers state-of-the-art performance within its size category across various multimodal benchmarks, often matching the performance of models four times its size.

The paper concludes with the release of the Idefics2 model (base, instructed, and chat versions) and the datasets used for its training. This paper is a significant contribution to the field of vision-language models, offering a comprehensive and critical examination of VLM design and performance.