• Author(s): Zheng Gu, Shiyuan Yang, Jing Liao, Jing Huo, Yang Gao

The paper titled “Analogist: A Novel Inference-Based Visual In-Context Learning Approach” explores advancements in Visual In-Context Learning (ICL), a field that leverages analogical reasoning to perform various tasks with limited example pairs. Traditional training-based visual ICL methods face challenges in generalizing to unseen tasks and necessitate the collection of diverse task datasets. Conversely, existing inference-based visual ICL methods rely solely on textual prompts, which often fail to capture fine-grained contextual information and can be inefficient due to the need for converting images to text prompts.

To address these limitations, the authors introduce Analogist, an innovative inference-based visual ICL approach that integrates both visual and textual prompting techniques. This method utilizes a text-to-image diffusion model pretrained for image inpainting. For visual prompting, the paper proposes a self-attention cloning (SAC) method, which guides fine-grained structural-level analogies between image examples. For textual prompting, the approach leverages GPT-4V’s visual reasoning capabilities to generate efficient text prompts and introduces a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogies guided by text prompts.

Analogist is designed to be an out-of-the-box solution that does not require fine-tuning or optimization, making it a generic and flexible tool capable of performing a wide range of visual tasks in an in-context manner. The method’s effectiveness is demonstrated through extensive experiments, which show that Analogist outperforms existing approaches both qualitatively and quantitatively.
This research represents a significant advancement in the field of visual ICL by providing a robust solution that combines the strengths of visual and textual prompting. The proposed method not only improves the efficiency and accuracy of visual task performance but also broadens the applicability of visual ICL techniques, making them more accessible and practical for various applications.