• Author(s) : Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang

The research paper “On the Content Bias in Fréchet Video Distance” delves into the intricacies of the Fréchet Video Distance (FVD), a widely used metric for evaluating video generation models. While FVD has gained prominence in the field, it has been observed to occasionally conflict with human perception. This paper aims to investigate the extent of FVD’s bias toward per-frame quality over temporal realism and identify the sources of this bias.

To quantify FVD’s sensitivity to the temporal axis, the authors decouple the frame and motion quality and find that FVD increases only slightly even with significant temporal corruption. This observation suggests that FVD places more emphasis on the quality of individual frames rather than the temporal coherence of the video.

Further analysis of the generated videos reveals that by carefully sampling from a large set of generated videos that lack motion, one can substantially decrease FVD without actually improving the temporal quality. This finding reinforces the notion that FVD is biased towards the quality of individual frames rather than the overall temporal realism of the video.

The authors attribute this bias to the features extracted from a supervised video classifier trained on a content-biased dataset. They demonstrate that by using features extracted from recent large-scale self-supervised video models, the FVD becomes less biased toward image quality and more sensitive to temporal aspects.

To validate their hypothesis, the authors revisit real-world examples and confirm the presence of content bias in FVD. This analysis highlights the limitations of relying solely on FVD for evaluating video generation models and emphasizes the need for a more comprehensive evaluation approach that takes into account both per-frame quality and temporal realism.

The implications of this research are significant for the development and evaluation of video generation models. By shedding light on the content bias in FVD, the paper encourages researchers and practitioners to be cautious when interpreting FVD scores and to consider alternative evaluation metrics that better capture the temporal aspects of generated videos.

In conclusion, the research paper “On the Content Bias in Fréchet Video Distance” provides a critical analysis of the limitations of FVD in evaluating video generation models. By quantifying the bias toward per-frame quality and identifying its sources, the authors contribute to a better understanding of the challenges in assessing the quality of generated videos. As the field of video generation continues to advance, this research serves as a valuable resource for developing more robust and comprehensive evaluation metrics that align with human perception.