• Author(s) : Yiwen Tang, Jiaming Liu, Dong Wang, Zhigang Wang, Shanghang Zhang, Bin Zhao, Xuelong Li

Large foundation models have recently gained significant attention due to their superior performance across a wide range of scenarios. However, the scarcity of 3D data has led researchers to adapt pre-trained transformers from vision to 3D domains. While these 2D-to-3D approaches have shown promise, they are limited by the potential loss of spatial geometries and high computational costs. Moreover, their frameworks are primarily designed for 2D models, lacking a general any-to-3D paradigm.

To address these challenges, a novel method called Any2Point has been introduced. Any2Point is a parameter-efficient approach that empowers large models from any modality (vision, language, audio) for 3D understanding. The method utilizes a 3D-to-any (1D or 2D) virtual projection strategy that correlates input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables the assignment of positional encodings to each 3D token, paired with the pre-trained model, avoiding the loss of 3D geometry caused by true projection and better motivating the transformer for 3D learning with 1D/2D positional priors.

Within each transformer block, Any2Point inserts an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaptation of any-modality transformers.

Extensive experiments have been conducted to showcase the effectiveness and efficiency of Any2Point. The results demonstrate the method’s ability to empower large models from various modalities for 3D understanding while maintaining parameter efficiency. By leveraging the power of pre-trained transformers and adapting them to 3D domains, Any2Point opens up new possibilities for 3D understanding tasks.

The introduction of Any2Point marks a significant step forward in the field of 3D understanding, providing a general any-to-3D paradigm that can be applied to large models from any modality. This innovative approach has the potential to unlock new applications and improve the performance of 3D understanding tasks across various domains.