Shuchen Weng†1 Haojie Zheng†2,3 Peixuan Zhang4 Yuchen Hong5,6
Han Jiang3 Si Li4 Boxin Shi‡5,6
1 BeijingAcademy of Artificial Intelligence
2 School of Software and Microelectronics, Peking University 3 OpenBayes Inc, Beijing, China
4 School of Artificial Intelligence, Beijing University of Posts and Telecommunications
5 National Engineering Research Center of Visual Technology, School of Computer Science, Peking University
6 National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Abstract
We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings. Project page: https://suimuc.github.io/suimu.github.io/projects/VIRES/
1 Introduction
Video is a crucial medium for people to capture and communicate their experiences and ideas.Traditionally, maintaining temporal consistency when editing videos requires specialized skills and considerable effort.With the advancements in diffusion models[34, 6], recent text-guided video editing methods[42, 27] attract considerable interest and offer the potential to revolutionize filmmaking and entertainment through a controllable approach.
While text can convey general video editing goals, it often leaves considerable room for interpretation of the user’s intent (e.g., logo details on clothing). As the improvement, recent works[18, 47, 14] introduce additional guidance (e.g., sketch) for more precise control. However, relying on pre-trained text-to-image models (e.g., Stable Diffusion [31]), these zero-shot methods struggle with temporal consistency, inevitably resulting in flickering.As an alternative, VideoComposer[38] fine-tunes the text-to-image model and incorporates temporal modeling layers. However, its focus on compositionality limits the accurate alignment with the provided sketch sequence, resulting in suboptimal fine-grained instance-level edits.
In this paper, we propose the VIRES, a Video Instance REpainting method with Sketch and text guidance. As illustrated inFig.1, VIRES facilitates four application scenarios to repaint instances within the provided video (i.e., video instance repainting, replacement, generation, and removal).Leveraging the generative priors of text-to-video models (i.e., Open-Sora[52]), our approach maintains temporal consistency and produces visually pleasing results.To provide precise user control over instance properties, we design tailored modules that process sketch for the structure (e.g., pose and movement) and text for the appearance (e.g., color and style).Specifically, we present the Sequential ControlNet to effectively extract structure layouts and the standardized self-scaling to adaptively capture high-contrast structure details.Additionally, to interpret and inject fine-grained sketch semantics into the latent space, we augment the Diffusion Transformer (DiT) backbone with the sketch attention.Finally, the sketch-aware encoder aligns repainted results with multi-level texture features during the latent code decoding.
We further contribute the VireSet, a new dataset for training and evaluating video instance editing methods. VireSet provides 85K training videos and 1K evaluation videos, sourced from the SA-V[30]. For each video, we generate text annotations from the large language model (LLM) [45] and extract sketch sequences using HED edge detection[44]. Our contributions are summaries as follows:
- •
We propose the first DiT-based framework for controllable video instance editing, along with a tailored dataset for training and evaluating relevant editing methods.
- •
We present the Sequential ControlNet and the standardized self-scaling to effectively extract structure layouts and adaptively capture high-contrast sketch details.
- •
We introduce the sketch attention to interpret and inject fine-grained sketch semantics and sketch-aware encoder to align repainted results with multi-level texture features.
2 Related work
2.1 Diffusion-based image generation
Diffusion models[34, 6] have demonstrated remarkable advancements in image generation[31, 33]. These models learn to reverse the forward diffusion process, generating images by iterative denoising a sample from a standard Gaussian distribution. To control this generation process, researchers use adapter mechanisms to incorporate additional conditions[50, 22].Leveraging the generative priors encapsulated in the pre-trained generation model, users can customize content via fine-tuning[32, 17] and edit arbitrary images via training-free strategies[21, 24].These advancements expand the application of diffusion models to various image-based tasks, e.g., super resolution[39], image colorization[41], and reflection removal[8].Given the similar spatial representations between images and videos, researchers are inspired to effectively lift these pre-trained image priors to the video domain.
2.2 Diffusion-based video generation
Early text-to-video generation works attempt to extend the UNet-based denoising network with space-only 3D convolutions, training them from scratch[7].However, this approach proves computationally expensive, leading researchers[36, 5] to explore fine-tuning pre-trained text-to-image models[31]. This fine-tuning typically introduces additional temporal modules (e.g., temporal convolutions[35] or temporal attention[2]) to improve the temporal consistency, while freezing most of the pre-trained denoising network weights to preserve spatial generative priors.To further reduce generation costs, researchers explore zero-shot text-guided video editing that solely performs during the inference phase[14, 51].Recently, impressive text-to-video generation results from OpenAI[20] have renewed interest in training models from scratch, but this time leveraging the diffusion transformer architecture[52, 53, 48].Given the significant improvements demonstrated on video generation benchmarks[10], we build upon this foundation and develop modules for controllable video generation.
3 Dataset
Existing datasets[25, 1] face challenges in providing precise instance masks due to temporal inconsistencies (e.g., occlusions and reappearances) and variations in visual appearance (e.g., complex motion, lighting, and scale). Therefore, we present the VireSet, a dataset including high-quality instance masks tailored for training and evaluating video instance editing methods.
Data acquisition. Our initial video samples are sourced from SA-V[30], a dataset includes indoor (54%) and outdoor (46%) scenes recorded across 47 countries by diverse participants. The original videos have an average duration of 14 seconds and a resolution of pixels, with multiple manual and automatic instance mask annotations provided at 6 FPS.SA-V adopts a data engine with a verification step to ensure high-quality instance mask annotations. To further improve temporal consistency for video instance editing, we leverage the pre-trained SAM-2 model[30] to annotate the intermediate frames, increasing the mask annotation rate to 24 FPS.
Instance selection. We filter instances to include only those covering at least 10% of the frame area. Additionally, We only use instances present for at least 51 consecutive frames, from which we randomly sample 51-frame clips. Each selected instance is then cropped from the original video with a small margin around its bounding box and resized to resolution.
Additional annotations. We adopt the pre-trained PLLaVa[45] to generate text descriptions for each cropped video clip, capturing their visual appearance. To ensure the quality, we recruit 10 volunteers to review 1% samples, achieving a 91% acceptance rate.Following VideoComposer[38], we extract sketch sequences using HED edge detection[44] to provide the structure guidance.
Dataset statistics. The final VireSet includes 85K video clips for training and 1K video clips for evaluation. Each clip consists of 51 frames at 24 FPS with a resolution of , accompanied by a sketch sequence for the structure and a text for the appearance.
4 Methodology
This section begins with an overview of the framework (Sec.4.1). We then delve into the proposed modules: the Sequential ControlNet with the standardized self-scaling (Sec.4.2), the sketch-based DiT backbone with the sketch attention (Sec.4.3), and the sketch-aware encoder to improve the decoding process (Sec.4.4).
4.1 Overview
Given an -frame original video clip , a corresponding sketch sequence presenting the structure, a text description describing the appearance, and an instance mask sequence indicating the repainted regions, VIRES is proposed to repaint the specific video instance. The overall pipeline is illustrated inFig.2.
Input encoding. The original video clip is encoded into the latent code using the pre-trained spatial-temporal VAE encoder [52] as .The sketch sequence is processed by our proposed Sequential ControlNet to effectively extract structure layouts as .The text is encoded with the pre-trained text encoder [29] to obtain the word embeddings as .
Condition injection. To adaptively capture the high-contrast structure details of the sketch sequence, we introduce the standardized self-scaling.We further augment the DiT backbone with the sketch attention, which interprets and injects fine-grained sketch semantics into the latent space.Word embeddings are injected into the DiT backbone with the pre-trained cross-attention modules.
Forward process. We adopt the Flow Matching formulation[19] for robust and stable training.For simplicity, we consider a linear path between the sampled latent code and the Gaussian noise :
(1) |
where is the timestep, with and .
Latent masking.After concatenating the mask with the latent code, we further selectively add noise according to the instance mask, allowing the model to repaint specific instances within the video:
(2) |
where is the downsampled mask sequence indicating where the repainted instance places.
Backward process. The reverse diffusion process can be defined by the Ordinary Differential Equation (ODE):
(3) |
where is the estimated vector field that guides the generation process.By solving the ODE from to , we transform the noise into the latent code.
Denoising learning.We adopt the flow matching objective[19] to train the denoising network . In the case of the linear path, the target velocity simplifies to . We optimize the estimated velocity towards the target velocity by minimizing the loss:
(4) |
Latent decoding.The denoised latent code is then decoded into a video clip using the pre-trained spatial-temporal VAE decoder [52]. To further align the repainted results with the sketch, we introduce a sketch-aware encoder , which adopts the same VAE encoder architecture to provide multi-level texture features as the guidance during decoding:
(5) |
4.2 Sequential sketch feature extraction
As presented inFig.2 (c), we propose the Sequential ControlNet to extract structure layouts from the sketch sequence. This is followed by the standardized self-scaling, designed to adaptively capture the high-contrast structure details of the extracted features.
Sequential ControlNet.Previous ControlNet[50] is designed for image editing, which lacks temporal consistency and can produce flickering artifacts when applied to video-based tasks. This motivates us to introduce Sequential ControlNet††Further variations are discussed in the Supp. for the video instance repainting.Our Sequential ControlNet includes convolutional layers, residual blocks, and downsampling layers. Each convolutional layer consists of a 3D causal convolution[49], followed by a Group Normalization[43] and a SiLU activation function[3], to effectively capture spatial-temporal dependencies between frames.Residual blocks consist of two such convolutional layers with a residual connection. Downsampling layers are convolutional layers with a stride of 2, applied to compress the spatial or spatial-temporal dimensions.Since the spatial-temporal VAE encoder[52] downsamples the original video clip by temporally and spatially, we adopt one spatial downsampling layer and two spatial-temporal downsampling layers to match the feature map dimensions of the DiT backbone.To align the embedding channels, we progressively increase the number of channels. Specifically, we increase the embedding channels in the shallow convolutional layers and double the channels at each downsampling layer.The final three layers maintain a high channel dimension to effectively extract structure layouts from the sketch sequence.
Standardized self-scaling.Feature modulation has proven effective in conditional image editing (e.g., AdaIN[9], FiLM[26], and SPADE[23]). Observing that the sketch has high-contrast transitions between black lines and the white background, we introduce the standardized self-scaling to adaptively capture sketch details, instead of performing simply addition. Specifically, we use sketch features extracted by the Sequential ControlNet and standardize them to scale the features themselves, effectively highlighting the high-contrast regions:
(6) |
where and represent the function of mean and standard deviation, respectively. We then shift the feature domain from sketch to video by aligning their means:
(7) |
where represents the video features. To reduce computational cost, standardized self-scaling is applied only once to the first transformer block of the DiT backbone.
4.3 Latent sketch semantic interpretation
As illustrated inFig.2 (d), following the standardized self-scaling that provides the high-contrast structure details, we further augment the DiT backbone with the sketch attention to interpret and inject fine-grained sketch semantics within the latent space.
DiT backbone.The VIRES framework builds upon the pre-trained text-to-video generation model[52], leveraging its DiT architecture that consists of stacked transformer blocks. Each block incorporates separate spatial and temporal self-attention modules to capture intra-frame and inter-frame dependencies, respectively. In each module, Self-Attention (SA) and Feed-Forward Network (FFN) extract contextual features, Cross-Attention (CA) injects semantics from word embeddings, and the scale and shift (S&S) and the gate modulate the latent code with timestep embeddings.We concatenate the latent code with the instance mask before feeding it to the transformer, and patchify both the latent code and the sketch sequence into non-overlapping tokens, resulting in a spatial downsampling.
Sketch attention.To interpret and inject sketch semantics into the latent space, we augment the DiT backbone with the sketch attention within each spatial self-attention module except for the first. The sketch attention incorporates a predefined binary matrix to indicate correspondences between the latent code and the sketch sequence:
(8) |
where and are transformed features from the video features and the extracted structure layouts , respectively. is the number of embedding channels.Sketch attention is implemented as a parallel branch, and its outputs are added with a learnable scaling parameter , allowing adaptive weighting of injected sketch semantics.
4.4 Visual sketch texture guidance
As illustrated inFig.2 (e), during decoding the latent code into a video clip, we additionally adopt a sketch-aware encoder, which provides multi-level texture features to the spatial-temporal VAE decoder and aligns the structure of repainted results with the provided sketch sequence.
Spatial-temporal VAE.We use the pre-trained spatial-temporal VAE[52] to encode the original video clip into a latent code and subsequently decode it back into a video clip.Specifically, the spatial-temporal encoder firstly uses a stack of 2D vanilla convolution blocks with three downsampling layers, achieving spatially compression. Following this, 3D causal convolution blocks capture spatial-temporal dependencies with two downsampling layers applied only to the temporal dimension, resulting in temporal compression.The spatial-temporal VAE decoder mirrors the architecture of the encoder, using 3D causal convolutions for spatial-temporal modeling and upsampling the temporal dimension before the spatial dimension.
Sketch-aware encoder.During decoding the latent code into a video clip, the spatial-temporal VAE decoder can produce numerous visually pleasing results. However, lacking guidance from the sketch, many results may not align with expected visual representations. Therefore, we introduce the sketch-aware encoder to guide this process. Specifically, the sketch-aware encoder adopts the same architecture as the VAE encoder, extracting multi-level texture features before each downsampling layer. Then, these features are added to the decoder at each corresponding level:
(9) |
where subscript denotes the feature level.We train the sketch-aware encoder by minimizing a combination of an SSIM loss [40] for structure alignment, L1 loss for reconstruction fidelity, perceptual loss [12] for sharp details, and KL loss [16] for regularization. The combined loss is calculated for each frame and then summed over all frames:
(10) |
where indexes the frames. , , and are hyper-parameters that are not sensitive to variations in a certain range.
5 Experiment
Training details.We initialize our spatial-temporal VAE and DiT backbone with pre-trained weights from OpenSora v1.2††https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3 and train the VIRES model to repaint videos at a 512512 resolution. The training process consists of three steps:(i) With the pre-trained spatial-temporal VAE encoder and decoder frozen, we train the sketch-aware encoder for 22K steps; (ii) With both VAEs and the pre-trained DiT backbone frozen, we train the Sequential ControlNet, the standardized self-scaling, and output matrices of the self-attention layers for 35K steps; and (iii) After augmenting the DiT backbone with the sketch attention, we train the model for an additional 45K steps.All experiments are conducted on 8 H100 GPUs using the Adam optimizer[15] with a learning rate of .
Evaluation Datasets. We evaluate our approach on two datasets: our collected VireSet and the widely-used DAVIS[25]. 50 videos are randomly selected from each dataset for evaluation. The condition annotation process of the DAVIS is consistent with that of VireSet, and details are presented inSec.3.
5.1 Quantitative evaluation metrics
We quantitatively evaluate performance using five metrics:(i) the Peak Signal-to-Noise Ratio (PSNR)[11] for the visual perceptual quality;(ii)the Structural Similarity Index Measure (SSIM)[40] for spatial structure consistency;(iii)the Warp Error (WE), calculated by warping repainted video clips with the estimated optical flow, following VidToMe[18] to evaluate motion accuracy;(iv) the Frame Consistency (FC), measuring cosine similarity between consecutive frames for temporal consistency using the CLIP image encoder[28];and (v) the Text Consistency (TC), measuring cosine similarity between repainted video clips and text descriptions for text-video alignment using the VideoCLIP-XL[37].
5.2 Comparison with state-of-the-art methods
We compare our approach with state-of-the-art video editing methods for generating 51-frame video clips, including VideoComposer[38], Text2Video-Zero[14], Rerender[47], VidToMe[18], and RAVE[13] using their publicly released models with default configurations.
Qualitative comparisons.We present visual quality comparison results with aforementioned methods [38, 14, 47, 18, 13]. As shown inFig.3, the left column edits instance appearance using text descriptions while preserving structure with extracted sketches. Conversely, the right column maintains its appearance using annotated text descriptions and edits structure via sketch modifications.As a result, Rerender [47] produces excessively smooth textures (e.g., Fig.3 left, the unnatural smoothness of the horse and tree).VidToMe[18] generates a cartoonish effect (e.g., Fig.3 left, the stylized appearance of the entire frame).Text2Video-Zero[14] presents inaccurate color tone (e.g., Fig.3 right, the white jersey of the player).RAVE[13] suffers from inter-frame flickering (e.g., Fig.3 right, the player’s neck flashing with frame brightness fluctuations).VideoComposer[38] struggles to follow the target structure (e.g., Fig.3 right column, the missing sleeves of the player).In contrast, our VIRES faithfully repaint the video instance with the sketch and text guidance.
Quantitative comparisons.We show comprehensive quantitative comparison results inTab.1, demonstrating that VIRES outperforms relevant state-of-the-art methods across all five quantitative metrics on both datasets.Specifically, VIRES achieves the best results in visual perceptual quality (PSNR), spatial structure consistency (SSIM), frame motion accuracy (WE), consecutive frame consistency (FC), and text description consistency (TC).
User study.In addition to qualitative and quantitative comparisons, we conduct two user studies to evaluate human preference for our results:(i) Visual Quality Evaluation (VQE):Participants are shown the edited results produced by VIRES and relevant video editing methods. They are asked to select the most visually pleasing video clip.(ii) Text Alignment Evaluation (TAE):Given a corresponding text description, participants are instructed to select the video clip from the same set of edited results that best matches the description.For each experiment, we randomly select 10 samples from each dataset, and recruit 25 volunteers from Amazon Mechanical Turk (AMT) to provide independent evaluations. As shown in Tab.1, our model achieves the highest preference scores in both experiments.
5.3 Ablation study
We discard several modules and establish four baselines to study the impact of the corresponding modules. The evaluation scores and repainted results of the ablation study are presented in Tab.1 and Fig.4, respectively.
W/o Standardized Self-Scaling (SSS).We replace the standardization self-scaling with simple addition when initially injecting the structure layout. This leads to distorted textures in the repainted results (Fig.4 third column, an unusual texture appears around the fish’s eye).
W/o Self-Scaling (SS).We discard the self-scaling, but preserve the standardization for the extracted structure layout before adding them to the DiT backbone. As a result, the model struggles to capture structure details (Fig.4 second column, the fish head appears less detailed).
W/o Sketch Attention (SA).We remove the sketch attention in the DiT backbone, thereby removing the third stage of training. This prevents the model from accurately interpreting fine-grained sketch semantics (Fig.4 fourth column, abrupt yellow patches appear on the fish’s body).
W/o Sketch-aware Encoder (SE).We disable the sketch-aware encoder that provides multi-level features to guide the decoding process, resulting in repainted results that are not well aligned with the expected visual representations (Fig.4 first column, the fish body appearing blurry and flat).
5.4 Application
In addition to four application scenarios illustrated inFig.1 (i.e., video instance repainting, video instance replacement, custom instance generation, and specified instance removal), we present additional three applications.
Sketch-to-video generation.VIRES is capable of generating entire video clips from sketch sequences with text guidance, not limited to repainting specific instances.This is achieved by discarding the original video content and expanding the instance mask to the entire frame, enabling the training-free sketch-to-video generation (e.g., Fig.5 first row, generating a realistic lotus flower from a provided sketch sequence).
Sparse sketch guidance.To reduce user effort when repainting video instances, we incorporate sparse sketch guidance into the VIRES. Instead of requiring a full sketch sequence, we fine-tune the VIRES using two strategies: (i) random dropout of sketch frames with 20% probability; and splitting the video into intervals with 80% probability, where is a randomly selected interval index. For each interval, only the first sketch frame is provided, with missing frames replaced by black images. As a result, VIRES can repaint video instances with even a single sketch frame (e.g., Fig.5 second row, repainting the pattern on the person’s clothes with the first sketch frame).
Long-duration video repainting.Due to the computational resource limitation, processing the full long-duration video at once is challenging.However, VIRES can effectively repaint long-duration videos by randomly unmasking the frames during training (e.g., the first frames).Specifically, we initially repaint the first 51-frame video clip, leveraging the last 17 frames of this clip as a hint to iteratively repaint the subsequent 34-frame clips until the entire video is processed. We present the results in the Supp.
6 Conclusion
In this paper, we present the VIRES, a Video Instance REpainting method with Sketch and text guidance.We build VIRES upon the pre-trained text-to-video generation model to maintain temporal consistency. We propose a Sequential ControlNet with a standardized self-scaling to effectively extract structure layouts and adaptively capture high-contrast sketch details.We further augment the DiT backbone with the sketch attention to interpret and inject fine-grained sketch semantics, and introduce a sketch-aware encoder for structure alignment during decoding.For training and evaluating video editing methods, we contribute VireSet.In addition to video instance repainting, replacement, generation, and removal, VIRES is applicable to sketch-to-video generation, sparse sketch guidance, and long-duration video repainting.Extensive experiments demonstrate that VIRES outperforms existing video editing methods and receives higher ratings from human evaluators.
Limitation.Our VIRES is based on the DiT architecture[52], which leads to higher computational requirements. For example, editing a 51-frame video at resolution takes approximately 323 seconds on a single A6000 GPU, which is about 1.6 times slower than other diffusion-based methods[14, 38]. We believe faster sampling methods will reduce the inference time in the future.
7 Appendix
7.1 Variations of Sequential ControlNet
We present four typical architectures of Sequential ControlNet inTab.2, where “Conv” denotes a convolutional layer, "Block" means a residual block, and "Down" is a downsampling layer. Numbers in brackets indicate the input and output channel dimensions, respectively. To determine the optimal architecture under constrained computational resources, we train these variations for 10K steps, excluding the sketch attention and sketch-aware encoder. Quantitative results on VireSet are shown in Tab.3, and the best-performing architecture is selected for our model.
7.2 Validation of additional conditions
VIRES is designed specifically for sketch-based video instance repainting, adopting the Standardized Self-Scaling (SSS) to extract condition features. To explore its generalization to other conditioning signals, we augment our dataset with edge maps[4] and depth maps[46] to provide the structure guidance. We then retrain VIRES from scratch on these conditions. Due to limited computational resources, we train for 10K steps with edge maps as a preliminary investigation and 30K steps with depth maps due to its slower convergence. Finally, we compare two VIRES variants: one using SSS and the other using simple addition for feature extraction.As shown inTab.4, SSS provides considerable advantages for edge maps, but only marginal improvements for depth maps. We suggest that this difference arises because both sketch and edge maps have high-contrast transitions between black lines and the white background, allowing self-scaling to effectively capture structure details. In contrast, depth maps are smoother and lack such sharp transitions, limiting the benefits of the self-scaling operation. We further present the qualitative results inFig.6.
7.3 Robustness of sparse conditions
VIRES allows users to provide sparse sketch guidance for video instance repainting, minimizing user effort.To investigate the impact of sketch guidance sparsity on repainting performance, we evaluate VIRES on the VireSet, using varying interval indices , corresponding to sketch frames, and a full set of . As shown inTab.5, even with a single sketch frame, VIRES can produce high-quality results (PSNR and SSIM) with robust temporal consistency (WE and FC).
7.4 Compatibility with DiT backbone
In this paper, we build VIRES upon the pre-trained OpenSora v1.2[52]. Given that many text-to-video models[53, 48] share a similar DiT backbone architecture, our proposed modules offer potential compatibility with these approaches, including the Sequential ControlNet for layout initialization (Sec. 4.2), the standardized self-scaling for details capture (Sec. 4.2), the sketch attention (Sec. 4.3) for semantic injection, and the sketch-aware encoder for structure alignment (Sec. 4.4).We believe our work will inspire further research on guiding pre-trained text-to-video models and open new avenues for conditional video repainting.
7.5 Organization of supplementary video
We provide a supplementary video to dynamically showcase our repainting results. The video is structured as follows:(i) Typical application scenarios: We demonstrate four typical repainting scenarios and compare our results with relevant methods[47, 18, 14, 13, 38]. Instance repainting/replacement results are shown inFig.7, and instance generation/removal results are inFig.8.(ii) Sketch-to-video generation and inpainting: We demonstrate sketch-to-video generation and conditional video inpainting, comparing our results with VideoComposer[38], as it is the only relevant method that supports this functionality. Results are shown inFig.9.(iii) Sparse sketch guidance: We showcase sparse sketch guidance, repainting two distinct variations of the same video using only two different first sketch frames. This functionality is not supported by existing methods. Results are shown inFig.10.(iv) Long-duration video repainting: We demonstrate repainting on a long-duration (13-second) video, with representative frames presented inFig.11.(v) Comparison and ablation study: Finally, the video includes comparisons with other methods (Sec.5.1) and additional ablation studies (Sec.5.2).To improve visual clarity and facilitate detailed comparison, the video playback speed is halved ( slower).
References
- Bain etal. [2021]Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In ICCV, 2021.
- Bertasius etal. [2021]Gedas Bertasius, Heng Wang, and Lorenzo Torresani.Is space-time attention all you need for video understanding?In ICML, 2021.
- Elfwing etal. [2018]Stefan Elfwing, Eiji Uchibe, and Kenji Doya.Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 2018.
- Gonzales and Wintz [1987]RafaelC Gonzales and Paul Wintz.Digital image processing.Addison-Wesley Longman Publishing Co., Inc., 1987.
- He etal. [2022]Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen.Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022.
- Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.In NeurIPS, 2020.
- Ho etal. [2022]Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and DavidJ Fleet.Video diffusion models.In NeurIPS, 2022.
- Hong etal. [2024]Yuchen Hong, Haofeng Zhong, Shuchen Weng, Jinxiu Liang, and Boxin Shi.L-DiffER: Single image reflection removal with language-based diffusion model.ECCV, 2024.
- Huang and Belongie [2017]Xun Huang and Serge Belongie.Arbitrary style transfer in real-time with adaptive instance normalization.In ICCV, 2017.
- Huang etal. [2024]Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, etal.VBench: Comprehensive benchmark suite for video generative models.In CVPR, 2024.
- Huynh-Thu and Ghanbari [2008]Quan Huynh-Thu and Mohammed Ghanbari.Scope of validity of psnr in image/video quality assessment.Electronics letters, 2008.
- Johnson etal. [2016]Justin Johnson, Alexandre Alahi, and Li Fei-Fei.Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
- Kara etal. [2024]Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, JamesM Rehg, and Pinar Yanardag.RAVE: Randomized noise shuffling for fast and consistent video editing with diffusion models.In CVPR, 2024.
- Khachatryan etal. [2023]Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi.Text2Video-Zero: Text-to-image diffusion models are zero-shot video generators.In ICCV, 2023.
- Kingma and Ba [2014]DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
- Kullback and Leibler [1951]Solomon Kullback and RichardA Leibler.On information and sufficiency.The annals of mathematical statistics, 1951.
- Kumari etal. [2023]Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu.Multi-concept customization of text-to-image diffusion.In CVPR, 2023.
- Li etal. [2024]Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang.VidToMe: Video token merging for zero-shot video editing.In CVPR, 2024.
- Lipman etal. [2023]Yaron Lipman, RickyTQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le.Flow matching for generative modeling.In ICLR, 2023.
- Liu etal. [2024]Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, etal.Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024.
- Meng etal. [2021]Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon.SDEdit: Guided image synthesis and editing with stochastic differential equations.In ICLR, 2021.
- Mou etal. [2024]Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan.T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.In AAAI, 2024.
- Park etal. [2019]Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.Semantic image synthesis with spatially-adaptive normalization.In CVPR, 2019.
- Parmar etal. [2023]Gaurav Parmar, Krishna KumarSingh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.Zero-shot image-to-image translation.In SIGGRAPH, 2023.
- Perazzi etal. [2016]Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc VanGool, Markus Gross, and Alexander Sorkine-Hornung.A benchmark dataset and evaluation methodology for video object segmentation.In CVPR, 2016.
- Perez etal. [2018]Ethan Perez, Florian Strub, Harm DeVries, Vincent Dumoulin, and Aaron Courville.FiLM: Visual reasoning with a general conditioning layer.In AAAI, 2018.
- Qi etal. [2023]Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen.FateZero: Fusing attentions for zero-shot text-based video editing.In ICCV, 2023.
- Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, 2021.
- Raffel etal. [2020]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 2020.
- Ravi etal. [2024]Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, etal.SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024.
- Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In CVPR, 2022.
- Ruiz etal. [2023]Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman.DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation.In CVPR, 2023.
- Saharia etal. [2022]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyL Denton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, Tim Salimans, etal.Photorealistic text-to-image diffusion models with deep language understanding.In NeurIPS, 2022.
- Song etal. [2021]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.In ICLR, 2021.
- SooKim and Reiter [2017]Tae SooKim and Austin Reiter.Interpretable 3D human action analysis with temporal convolutional networks.In CVPR workshops, 2017.
- Wang etal. [2023]Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li.Gen-L-Video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023.
- Wang etal. [2024a]Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin.VideoCLIP-XL: Advancing long description understanding for video clip models.arXiv preprint arXiv:2410.00741, 2024a.
- Wang etal. [2024b]Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou.VideoComposer: Compositional video synthesis with motion controllability.In NeurIPS, 2024b.
- Wang etal. [2024c]Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, AlexC Kot, and Bihan Wen.SinSR: diffusion-based image super-resolution in a single step.In CVPR, 2024c.
- Wang etal. [2004]Zhou Wang, AlanC. Bovik, HamidR. Sheikh, and EeroP. Simoncelli.Image quality assessment: from error visibility to structural similarity.TIP, 2004.
- Weng etal. [2024]Shuchen Weng, Peixuan Zhang, Yu Li, Si Li, Boxin Shi, etal.L-CAD: Language-based colorization with any-level descriptions using diffusion priors.In NeurIPS, 2024.
- Wu etal. [2023]JayZhangjie Wu, Yixiao Ge, Xintao Wang, StanWeixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and MikeZheng Shou.Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation.In ICCV, 2023.
- Wu and He [2018]Yuxin Wu and Kaiming He.Group normalization.In ECCV, 2018.
- Xie and Tu [2015]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.In ICCV, 2015.
- Xu etal. [2024]Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, SeeKiong Ng, and Jiashi Feng.PLLaVA: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024.
- Yang etal. [2024a]Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao.Depth anything v2.arXiv preprint arXiv:2406.09414, 2024a.
- Yang etal. [2023]Shuai Yang, Yifan Zhou, Ziwei Liu, and ChenChange Loy.Rerender a video: Zero-shot text-guided video-to-video translation.In SIGGRAPH Asia, 2023.
- Yang etal. [2024b]Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, etal.CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024b.
- [49]Lijun Yu, Jose Lezama, NiteshBharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, AlexanderG Hauptmann, etal.Language model beats diffusion-tokenizer is key to visual generation.In ICLR.
- Zhang etal. [2023]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In ICCV, 2023.
- [51]Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian.ControlVideo: Training-free controllable text-to-video generation.In ICLR.
- Zheng etal. [2024]Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You.Open-Sora: Democratizing efficient video production for all, 2024.
- Zhou etal. [2024]Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang.Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024.