• Author(s): Weihao Yu, Xinchao Wang

Mamba, an architecture featuring an RNN-like token mixer based on the state space model (SSM), was introduced to address the quadratic complexity of the attention mechanism and applied to vision tasks. However, Mamba’s performance in vision tasks often falls short compared to convolutional and attention-based models.

This paper explores the fundamental nature of Mamba, concluding that it is best suited for tasks with long-sequence and autoregressive characteristics. Image classification does not align with these characteristics, suggesting that Mamba is unnecessary for this task. Detection and segmentation tasks, while not autoregressive, adhere to the long-sequence characteristic, warranting further exploration of Mamba’s potential.

To test these hypotheses, a series of models named MambaOut were created by stacking Mamba blocks but removing their core token mixer, SSM. Experimental results confirm our hypotheses: MambaOut models outperform all visual Mamba models on ImageNet image classification, indicating Mamba’s unnecessary role in this task. For detection and segmentation, MambaOut models do not match the performance of state-of-the-art visual Mamba models, highlighting Mamba’s potential for long-sequence visual tasks.