Neural fields have emerged as a transformative force in the fields of computer vision and robotics. Their adeptness at deciphering the complexities of the 3D visual world has been pivotal, particularly in how they can deduce semantics, geometry, and dynamics from mere 2D imagery to construct detailed 3D scene representations. This proficiency leads to a compelling line of inquiry: could the self-supervised pre-training of neural fields be augmented, especially by leveraging masked autoencoders, to yield more nuanced 3D representations from posed RGB images?

In the wake of transformers being successfully repurposed for a variety of data types, the industry has seen the integration of standard 3D Vision Transformers with Neural Radiance Fields (NeRFs). NeRFs stand out from other 3D representations, such as point clouds which often suffer from irregularities and inconsistent information density. NeRFs offer a volumetric grid that serves as a uniform and dense input to the transformer.

The application of masked autoencoders to the implicit representation of NeRFs presents a notable challenge. To navigate this, researchers have devised a method to distill an explicit representation by harnessing the camera’s trajectory to sample and thus normalize scenes across various domains. The technique involves concealing random patches within NeRF’s radiance and density grid, which are subsequently reconstructed using a standard 3D Swin Transformer. This strategy enables the model to assimilate the semantic and spatial structure of entire scenes.

This explicit representation undergoes pretraining on an extensive dataset of posed RGB images, amassing over 1.6 million images. Post-pretraining, the encoder is then applied to potentiate 3D transfer learning. The pioneering self-supervised pre-training approach for NeRFs, known as NeRF-MAE, demonstrates remarkable scalability and substantially bolsters performance on a spectrum of challenging 3D tasks. Notably, in the realm of 3D object detection on datasets such as Front 3D and ScanNet, NeRF-MAE outstrips existing self-supervised 3D pre training methods and NeRF scene understanding benchmarks, delivering a significant uptick in accuracy with over 20% AP50 and 8% AP25 improvements.