Author(s): Rui Li, Tobias Fischer, Mattia Segu, Marc Pollefeys, Luc Van Gool, Federico Tombari

Recovering 3D scene geometry from a single view poses a significant challenge in the field of computer vision. Traditional depth estimation methods can only infer a 2.5D scene representation confined to the image plane, while more recent techniques based on radiance fields can reconstruct a full 3D representation. However, these methods often struggle with occluded regions, as inferring geometry without visual observation necessitates both semantic knowledge of the surroundings and reasoning about spatial context.

A novel method for single-view scene reconstruction, known as KYN, has been proposed to address this issue. KYN reasons about semantic and spatial context to predict the density of each point. It introduces a vision-language modulation module to enrich point features with detailed semantic information. Point representations across the scene are aggregated through a language-guided spatial attention mechanism, resulting in per-point density predictions that are aware of the 3D semantic context.

It has been demonstrated that KYN enhances 3D shape recovery compared to predicting density for each 3D point individually. KYN achieves state-of-the-art results in scene and object reconstruction on the KITTI-360 dataset and shows improved zero-shot generalization compared to previous work.

Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning