• Author(s) : Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing

Open-world video instance segmentation is a challenging task that aims to simultaneously segment, track, and classify objects in videos, including those from novel categories not seen during training. Traditional video instance segmentation methods operate under a closed-world assumption, only identifying objects from a fixed set of categories. They also often require additional user input or rely on region-based proposals that struggle to detect never-before-seen objects.

To address these limitations, a new approach called Open-World Video Instance Segmentation and Captioning (OW-VISCap) has been proposed. OW-VISCap introduces open-world object queries that enable discovering objects from novel categories without needing any extra user input. These queries are learned to be distinct from each other through an inter-query contrastive loss, helping avoid generating overlapping object predictions.

In addition to segmenting and tracking objects, OW-VISCap generates rich, detailed captions describing each detected object by leveraging a masked attention augmented language model. This provides more informative object-centric descriptions compared to the single-word labels typically assigned by other methods.

Impressively, this generalized OW-VISCap framework matches or outperforms state-of-the-art methods across three different video understanding tasks:

  1. Open-world video instance segmentation on the BURST dataset
  2. Dense video object captioning on the VidSTG dataset
  3. Traditional closed-world video instance segmentation on the OVIS dataset

This demonstrates the effectiveness and versatility of the open-world query based approach used in OW-VISCap for tackling a range of video understanding problems, from open-world settings to standard closed-world scenarios. It represents an important step towards developing more flexible and capable video understanding systems that can identify both known and unknown objects in complex real-world videos.