Author(s) : Zifu Wan, Yuhao Wang, Silong Yong, Pingping Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie

“Sigma: Siamese Mamba Network for Multi-modal Semantic Segmentation” presents a novel approach for multi-modal semantic segmentation using a Siamese Mamba network. The authors propose the use of additional modalities, such as thermal and depth (X-modality), alongside traditional RGB to enhance AI agents’ perception and scene understanding, particularly in challenging conditions like low-light or overexposed environments.

The Sigma model employs a Selective Structured State Space Model, Mamba, to achieve global receptive field coverage with linear complexity, addressing the limitations of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). By utilizing a Siamese encoder and introducing a Mamba fusion mechanism, the model effectively selects and integrates essential information from different modalities. A specialized decoder is also developed to enhance the channel-wise modeling ability of the model.

Sigma is thoroughly evaluated on both RGB-Thermal and RGB-Depth segmentation tasks, showcasing its superior performance and marking the first successful application of State Space Models (SSMs) in multimodal perception tasks. The authors provide a detailed explanation of the model architecture, training process, and evaluation results, demonstrating the potential of the Sigma model for various applications in computer vision and robotics.

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation