• Author(s): Nathaniel Cohen, Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, Tomer Michaeli

The paper titled “Slicedit: Text-based Video Editing Using Pretrained Text-to-Image Diffusion Models” addresses the challenge of leveraging pretrained text-to-image (T2I) diffusion models for video editing. T2I diffusion models are known for their state-of-the-art performance in image synthesis and editing. However, applying these models to video editing has been difficult due to the need for maintaining temporal consistency across frames, especially in the presence of strong nonrigid motion.
Existing methods often attempt to enforce temporal consistency through explicit correspondence mechanisms, either in pixel space or between deep features. These approaches, however, struggle with handling significant nonrigid motion, leading to inconsistencies in the edited video. The paper introduces a novel approach based on the observation that spatiotemporal slices of natural videos exhibit similar characteristics to natural images. This insight allows the same T2I diffusion model, typically used as a prior on video frames, to enhance temporal consistency by applying it to spatiotemporal slices.

The proposed method, named Slicedit, utilizes a pretrained T2I diffusion model to process both spatial and spatiotemporal slices of the video. This approach enables the generation of edited videos that retain the original video’s structure and motion while adhering to the target text. Slicedit’s ability to maintain temporal consistency and handle nonrigid motion sets it apart from existing methods.

Extensive experiments demonstrate Slicedit’s effectiveness in editing a wide range of real-world videos. The results confirm the method’s clear advantages over competing approaches, showcasing its potential for practical applications in video editing. By leveraging the strengths of T2I diffusion models and addressing the challenges of temporal consistency, Slicedit represents a significant advancement in the field of text-based video editing.