This paper introduces an innovative approach to reconstruct the 3D world and multiple dynamic humans from a single monocular video input. The authors leverage the recently developed 3D Gaussian Splatting (3D-GS) representation, which enables efficient composition and rendering of both the environment and human subjects.
One of the key challenges addressed in this work is the scenario of limited and sparse 3D observations, a common issue in real-world settings. To tackle this challenge, the authors propose a novel optimization technique that fuses sparse cues in a canonical space. This approach utilizes a pre-trained 2D diffusion model to synthesize unseen views while maintaining consistency with the observed 2D appearances.

The proposed method demonstrates remarkable capabilities in reconstructing high-quality, animatable 3D humans in various challenging situations, such as occlusion, image crops, few-shot scenarios, and extremely sparse observations. Once the reconstruction is complete, the method allows rendering the scene from any novel viewpoint and time instance, as well as editing the 3D scene by removing individual humans or applying different motions to each human subject.

Through extensive experiments, the authors showcase the superior quality and efficiency of their approach compared to existing alternative methods. The proposed technique paves the way for more accurate and versatile 3D reconstructions of dynamic humans and environments from monocular video inputs, opening up new possibilities in various applications, including virtual and augmented reality, motion capture, and visual effects.