Each column corresponds to a different input video unseen during training, and each row corresponds to a different reference effect video. Our outputs are shown in the grid above.
We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target
video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned
editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations,
which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model
must integrate the new temporal dynamics with the input video’s existing motion and appearance. To address this,
we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image
or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially
the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated
pipeline that creates high-quality paired videos designed to preserve the input’s motion and structure while
transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived
from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our
new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results
demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect
categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
tldr; We create a model that takes in a reference effect video and an input video or image, and transfers the effect from the reference video to the input. We create a novel dataset of (reference effect video, input video or image, output video) triplets to do this.
We illustrate the dataset structure and a subset of representative effects. Neural Ref Video + First Frame → Video denotes LoRA-based image-to-video effects, where videos with the same effect are generated by using the same LoRA adapter across multiple input images to create (Reference Effect, Input Image, Target Video) triplets. Neural Ref Video + Video → Video corresponds to our scalable reference–input video pipeline, for which a subset of abbreviated tasks is displayed, creating (Reference Effect, Input Video, Target Video) triplets. Code-Based Ref Video + Input Video → Video represents procedurally generated effects, where each combination of effect type (e.g., Glow) and transition pattern (e.g., Fade In) is instantiated with multiple random hyperparameters, each forming a unique effect, creating more (Reference Effect, Input Video, Target Video) Triplets. We train our model with this dataset and the below architecture to produce a Video Editor capable of taking in both a reference effect video and an input video, and outputing a new video where the effect is applied to the input video
(left) Standard Wan Video First–Last Frame to Video (FLF2V) architecture: noisy spatio-temporal latents are channel-wise concatenated with conditioning inputs and a mask, then patchified, embedded, and processed by the diffusion transformer to predict velocity. (right) In our setup, input video latents are used as conditioning for the noisy latents, while reference video latents are concatenated width-wise to both. The latent mask is set to 1 for frames preserved exactly in the output and 0.5 for those to be modified; the reference latent mask is all ones. This design doubles the token count relative to base generation while conditioning on both the reference effect video and input video. Since all three inputs are channel-concatenated before patchification, repeated clean reference latents are merged channel-wise before embedding, ensuring no redundant reference information across tokens.
We compare our method for reference video + input video to output video to baselines that only take in the input video and a text prompt. Each column corresponds to a different I2V baseline unseen during training, and each row corresponds to a different reference effect video unseen during training
We compare our method for reference video + first frame to output video to baselines that only take in the input video and a text prompt. Each column corresponds to a different I2V baseline unseen during trainig, and each row corresponds to a different reference effect video unseen during training
Given a reference effect video and an input video, we can use variants of classifier free guidance in the direction of (left to right of video grid) the input video or (top to bottom of video grid) the reference effect video to guide the video editing process when editing real videos. As more input video guidance is applied, the video more closely resembles the input video (i.e. the top right corner of the grid). As more reference effect video guidance is applied, the video more closely resembles the reference effect video (i.e. the bottom left corner of the grid, notice the hair change and the sweater from the reference effect video). By interpolating between these two, we give users more control over the video editing process.
In order to trainRefVFX, We create a large scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. We create this dataset using three different methods:
We show examples of the novel video-to-video effect triplets from our dataset using our custom, scalable pipeline. To create (Reference Effect Video, Input Video, Output Video) triplets, we use the above algorithm twice with the same reference style text prompt (i.e. make it a painting) and different initial photos and poses. We take the two video pairs (Vinput, Vref effect), (Vinput 2, Vinput 2 ref effect), and remove the initial video from the first pair, leaving a (Reference Effect Video, Input Video, Output Video) triplet. Each row corresponds to a different novel video-to-video effect triplet from our dataset
We show examples of the code effects triplets from our dataset. For each triplet, we use the same code to transform two input videos into videos with the same effect. We use one of these as the reference effect video, and the other input output pair as the input video and target video, correspondingly. Each row corresponds to a different code-based effect triplet from our dataset
We show examples of the LoRA-Based effects triplets from our dataset. For each triplet, we use the same LoRA adapters to generate the reference effect video and target video from their corresponding input frames. Each row corresponds to a different LoRA-based effect triplet from our dataset