• Author(s): Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, Shubham Tulsiani

MVD-Fusion is introduced as a method for single-view 3D inference through the generative modeling of multi-view-consistent RGB-D images. While recent methods in 3D inference have advocated for learning novel-view generative models, these generations are not 3D-consistent and necessitate a distillation process to generate a 3D output.

MVD-Fusion, on the other hand, frames the task of 3D inference as directly generating mutually-consistent multiple views. It builds on the insight that inferring depth can provide a mechanism for enforcing this consistency. Specifically, a denoising diffusion model is trained to generate multi-view RGB-D images from a single RGB input image. The intermediate noisy depth estimates are leveraged to obtain reprojection-based conditioning to maintain multi-view consistency.

The model is trained using the large-scale synthetic dataset Obajverse, as well as the real-world CO3D dataset, which comprises generic camera viewpoints. It has been demonstrated that this approach can yield more accurate synthesis compared to recent state-of-the-art methods, including distillation-based 3D inference and prior multi-view generation methods.

The geometry induced by the multi-view depth prediction of MVD-Fusion is also evaluated. It has been found to yield a more accurate representation than other direct 3D inference approaches. This suggests that MVD-Fusion offers significant advancements in the field of 3D inference.