Author(s) : Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qing Yu, Jiaqi Wang

Integrating Large Language Models (LLMs) with three-dimensional environments presents significant challenges. Traditional methods rely on extracting point clouds from either accurate ground truth geometry or reconstructed 3D scenes using auxiliary models. These methods then elevate text-image aligned 2D features from models like CLIP to these point clouds, which are used as inputs for LLMs. However, this approach often fails to establish robust 3D point-to-point connections, resulting in a loss of critical spatial structure information. Additionally, the lack of cohesive integration between the geometric and semantic representations of scenes leads to suboptimal 3D scene comprehension.

Addressing these issues, this paper introduces a novel framework, Uni3DR², which significantly enhances the scene representation and reconstruction capabilities essential for effective LLM interaction in 3D environments. Uni3DR² utilizes frozen pre-trained 2D foundation models, such as CLIP and SAM, to extract geometric and semantically aware features. These features are then processed through a multi-scale aggregate 3D decoder to produce enriched 3D representations. These representations not only aid in the reconstruction process but also furnish LLMs with deep, actionable insights.

Empirical evaluations demonstrate that Uni3DR² achieves notable improvements over existing methods. On the 3D reconstruction dataset ScanNet, Uni3DR² increases the F-Score by +1.8%. Moreover, when applied to LLMs, Uni3DR²-LLM shows superior performance on the 3D vision-language understanding dataset ScanQA, enhancing BLEU-1 scores by +4.0% and +4.2% on the validation and test sets, respectively. It also surpasses the current leading method that utilizes additional ground truth point clouds on both ScanQA and 3DMV-VQA datasets.

In conclusion, Uni3DR² sets a new benchmark in 3D scene understanding and interaction for LLMs, paving the way for more sophisticated and practical applications in diverse fields such as robotics and autonomous navigation.

Unified Scene Representation and Reconstruction for 3D Large Language Models