• Author(s) : Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Le Lu, Jiebo Luo, Ling Zhang

Medical Vision-Language Pretraining (Med-VLP) aims to bridge the gap between visual content from medical images and their corresponding textual descriptions. While existing Med-VLP methods have primarily focused on 2D images depicting single body parts, such as chest X-rays, this paper extends the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios by utilizing a multimodal dataset of CT scans and associated reports.

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Compared to its 2D counterpart, 3D VLP faces the challenge of effectively capturing essential semantics from the significantly sparser representation inherent in 3D imaging. To address this, the researchers introduce CT-GLIP (Grounded Language-Image Pre Training with CT scans), a novel approach that constructs organ-level image-text pairs to enhance multimodal contrastive learning. This method aligns grounded visual features with precise diagnostic text, ensuring a strong connection between the visual and textual modalities.

Additionally, the authors developed an abnormality dictionary to augment the contrastive learning process with diverse negative samples. This innovative technique further enhances the model’s ability to accurately identify organs and abnormalities using natural language descriptions.

CT-GLIP was trained on a comprehensive multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs. The performance of the model was validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results demonstrate CT-GLIP’s superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.

This research represents a significant advancement in the field of medical vision-language pretraining, extending its capabilities to 3D imaging modalities and paving the way for more accurate and efficient diagnosis and analysis of full-body medical scans.