• Author(s) : Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki

The paper titled “DreamScene4D” introduces a novel approach to generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos. This is achieved by leveraging existing Video Language Models (VLMs) that can track 2D video objects and current generative models that provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting.

The key insight of this paper is the design of a “decompose-then-recompose” scheme to factorize both the whole video scene and each object’s 3D motion. The video scene is first decomposed using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is then mapped to a set of 3D Gaussians that deform and move in space and time.

The observed motion is factorized into multiple components to handle fast motion. The camera motion is inferred by re-rendering the background to match the video frames. For the object motion, the object-centric deformation of the objects is modeled by leveraging rendering losses and multi-view generative priors in an object-centric frame. Then, object-centric to world-frame transformations are optimized by comparing the rendered outputs against the perceived pixel and optical flow.

Finally, the background and objects are recomposed and optimized for relative object scales using monocular depth prediction guidance. The paper presents extensive results on the challenging DAVIS, Kubric, and self-captured videos, details some limitations, and provides future directions. Interestingly, the results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, even though it was never explicitly trained to do so. This paper is a significant contribution to the field of 4D scene generation.