• Author(s): Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

“4D Panoptic Scene Graph Generation” introduces a novel representation called the 4D Panoptic Scene Graph (PSG-4D) to enhance artificial intelligence’s understanding of dynamic 4D environments. This representation bridges the gap between raw visual data perceived in a 4D world and high-level visual understanding. PSG-4D abstracts rich 4D sensory data into nodes, representing entities with precise location and status information, and edges, capturing temporal relations.
To support research in this new area, the authors have developed a richly annotated PSG-4D dataset. This dataset consists of 3,000 RGB-D videos, totaling 1 million frames. Each frame is labeled with 4D panoptic segmentation masks and fine-grained, dynamic scene graphs. This comprehensive dataset provides a robust foundation for developing and testing new models in the field.

To address the challenges of PSG-4D, the authors propose PSG4DFormer, a transformer-based model. PSG4DFormer is capable of predicting panoptic segmentation masks, tracking these masks along the time axis, and generating corresponding scene graphs through a relation component. Extensive experiments conducted on the new dataset demonstrate that PSG4DFormer can serve as a strong baseline for future research on PSG-4D.

The paper also provides a real-world application example, showcasing how dynamic scene understanding can be achieved by integrating a large language model into the PSG-4D system. This example highlights the practical implications and potential applications of the proposed method.
In summary, the introduction of PSG-4D and the development of PSG4DFormer represent significant advancements in the fields of computer vision and pattern recognition. The comprehensive dataset and the proposed model provide valuable resources and a strong foundation for future research, aiming to enhance AI’s ability to understand and interpret dynamic 4D environments.