Tuning-free Visual Effect Transfer across Videos

Paper Project

See prompt
The video opens on a {subject}. A knife, held by a hand, is coming into frame and hovering over the {subject}. The knife then begins cutting into the {subject} to c4k3 cakeify it. As the knife slices the {subject} open, the inside of the {subject} is revealed to be cake with chocolate layers. The knife cuts through and the contents of the {subject} are revealed.
See prompt
Make it so that the beginning of the scene is unchanged, but during the video a dense, spooky green fog begins to slowly creep in at ground level on the bottom right, spreading outward and enveloping the scene in an eerie, unsettling atmosphere. the fog appears thick and unnatural, with a faint glow that adds to its ominous presence.
See prompt
The video begins with a close-up portrait of the asian man, the pu11y puppy effect then begins, as puppies begin to gather and surround the asian man. The asian man interacts with the puppies.



Each column corresponds to a different input video unseen during training, and each row corresponds to a different reference effect video. Our outputs are shown in the grid above.


Abstract

We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video’s existing motion and appearance. To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input’s motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.

tldr; We create a model that takes in a reference effect video and an input video or image, and transfers the effect from the reference video to the input. We create a novel dataset of (reference effect video, input video or image, output video) triplets to do this.

Method (Sythetic Triplet Dataset Overview)

We illustrate the dataset structure and a subset of representative effects. Neural Ref Video + First Frame → Video denotes LoRA-based image-to-video effects, where videos with the same effect are generated by using the same LoRA adapter across multiple input images to create (Reference Effect, Input Image, Target Video) triplets. Neural Ref Video + Video → Video corresponds to our scalable reference–input video pipeline, for which a subset of abbreviated tasks is displayed, creating (Reference Effect, Input Video, Target Video) triplets. Code-Based Ref Video + Input Video → Video represents procedurally generated effects, where each combination of effect type (e.g., Glow) and transition pattern (e.g., Fade In) is instantiated with multiple random hyperparameters, each forming a unique effect, creating more (Reference Effect, Input Video, Target Video) Triplets. We train our model with this dataset and the below architecture to produce a Video Editor capable of taking in both a reference effect video and an input video, and outputing a new video where the effect is applied to the input video

Architecture overview

Method (Architecture)

(left) Standard Wan Video First–Last Frame to Video (FLF2V) architecture: noisy spatio-temporal latents are channel-wise concatenated with conditioning inputs and a mask, then patchified, embedded, and processed by the diffusion transformer to predict velocity. (right) In our setup, input video latents are used as conditioning for the noisy latents, while reference video latents are concatenated width-wise to both. The latent mask is set to 1 for frames preserved exactly in the output and 0.5 for those to be modified; the reference latent mask is all ones. This design doubles the token count relative to base generation while conditioning on both the reference effect video and input video. Since all three inputs are channel-concatenated before patchification, repeated clean reference latents are merged channel-wise before embedding, ensuring no redundant reference information across tokens.

Architecture overview

Reference Video + Input Video to Video Baseline Comparisons

We compare our method for reference video + input video to output video to baselines that only take in the input video and a text prompt. Each column corresponds to a different I2V baseline unseen during training, and each row corresponds to a different reference effect video unseen during training

Reference Effect
Initial Video
Our Method
(Click < or > to see more baselines) Wan VACE First Frame + Pose
See prompt
The video starts with a {subject}. The camera does a slight pan forward, and as the camera is panning, the {subject} starts to have an acid affect applied to them, where the screen shimmers and waves and shows multiple separate versions of the scene in reds and blues, overlapping with one another in a wavelike manner
See prompt
Make it so that the beginning of the scene is unchanged, but during the video The person's body gradually takes on the appearance of carved marble, complete with fine cracks and subtle surface veining.a
See prompt
Make it so that the beginning of the scene is unchanged, but during the video A miniature red and white hot air balloon drifts lazily across the upper background, tethered to nothing.
See prompt
Transition between the input woman and an edited version with the following edit: Create a motion blur effect on the woman at an angle of 49 degrees with a strength of 0.51. Transition between frames 11 and 30 using: top right to bottom left diagonal wipe

Reference Video + Image -> Video Baseline Comparisons

We compare our method for reference video + first frame to output video to baselines that only take in the input video and a text prompt. Each column corresponds to a different I2V baseline unseen during trainig, and each row corresponds to a different reference effect video unseen during training

Reference Effect
Initial Image
Our Method
(Click < or > to see more baselines) Wan 2.1 I2V
See prompt
a photo of a {subject}. The {subject} slowly morphs into a younger version of themselves, and a wooden carousel horse appears underneath the {subject}, and the {subject} is riding it. At the same time, the background morphs into the backgdrop of a spinning carousel in a fair.
See prompt
The video starts with a {subject} with glasses. The camera then slowly zooms out, and a many hands holding microphones enter the frame from both sides, pointed towards the person. The hands with microphones completely fill the frame as the camera continues to zoom out. More people then appear behind the person, walking into the shot from both sides, as if they are a security detail. The new people are dressed in formal wear, and the hands with microphones stay in the foreground as the video ends.
See prompt
The video starts with a {subject}. As the camera slowly pulls back, the background changes to a colorful, surreal landscape filled with large ice cream cone sculptures. Simultaneously, the {subject} gets covered in a dripping effect from the top down that resembles melting ice cream.
See prompt
a photo of a {subject}. The {subject} has a shocked expression on their face, as their backdrop falls away, revealing a jungle background with a dinosaur chasing the {subject} as they run away

Real Video Editing with Input Video Guidance and Reference Effect Video Guidance

Given a reference effect video and an input video, we can use variants of classifier free guidance in the direction of (left to right of video grid) the input video or (top to bottom of video grid) the reference effect video to guide the video editing process when editing real videos. As more input video guidance is applied, the video more closely resembles the input video (i.e. the top right corner of the grid). As more reference effect video guidance is applied, the video more closely resembles the reference effect video (i.e. the bottom left corner of the grid, notice the hair change and the sweater from the reference effect video). By interpolating between these two, we give users more control over the video editing process.

Reference Effect Video
Input Video

Dataset Samples

In order to trainRefVFX, We create a large scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. We create this dataset using three different methods:

  • Video-to-video effect triplets, which do not exist naturally: We create a custom pipeline for generating text guided reference video + input video to video effects. For more details, see Figure 4 of the paper.
  • Code-based temporal effects generated through programmatic composition: We generate a large scale set of synthetic video to reference video + input video to video effects by curating specific code pipelines that take in an input video and output a video. Armed with a specific code based effect and a fixed set of hyperparameters, we can apply the exact effect to an arbitrary number of input videos.
  • Image-to-video effects derived from LoRA adapters. We curate a reference video + image to video dataset by collecting Low Rank Adapters (LoRAs) for different Image to Video effects online. For each effect, we can apply its corresponding LoRA to two separate images to create a triplet.


Training Data: Novel Video-to-Video Effect Sythetic Pair Algorithm

We present a method to generate a pair of videos V and V′ from an effect prompt E, where V is an initial video and V′ is the video with effect E applied. First, an image generation model is used to create an initial image. Next, an image editing model is used to change the pose, camera angle, and facial expression of the image. Finally, this image is again edited to add effect E. The first and second generated images are used with a first last frame model to output video V. Then, we use a conditional video model conditioned on the original first frame, effect edited last frame, and intermediate poses from video V as conditioning to create video V′.



Training Data - From Sythetic Pais to Triplets: Novel Video-to-Video Effect Triplet Examples

We show examples of the novel video-to-video effect triplets from our dataset using our custom, scalable pipeline. To create (Reference Effect Video, Input Video, Output Video) triplets, we use the above algorithm twice with the same reference style text prompt (i.e. make it a painting) and different initial photos and poses. We take the two video pairs (Vinput, Vref effect), (Vinput 2, Vinput 2 ref effect), and remove the initial video from the first pair, leaving a (Reference Effect Video, Input Video, Output Video) triplet. Each row corresponds to a different novel video-to-video effect triplet from our dataset

Reference Effect
Input Video
Target Video
See prompt
Make it so that the beginning of the scene is unchanged, but during the video a business suit slowly catches fire, flames licking at the fabric, but the person remains perfectly calm, standing composed as the fire continues to spread across their clothing.
See prompt
Make it so that the beginning of the scene is unchanged, but during the video a long, dramatic opera cape made entirely of pizza slices slowly materializes around their shoulders, flowing dramatically as if it were a traditional cloak, transforming their appearance into a whimsical and surreal fashion statement.
See prompt
Make it so that the beginning of the scene is unchanged, but during the video a full-blown blizzard erupts, with strong winds sweeping heavy snow across the scene, drastically reducing visibility and creating a chaotic, white-out effect.
See prompt
Make it so that the beginning of the scene is unchanged, but during the video the environment slowly begins to melt, transforming into a surreal, dreamlike landscape reminiscent of a dali painting, with soft, flowing edges and distorted, melting forms.

Training Data: Code-Based Reference Video + Input Video to Video Triplet Examples

We show examples of the code effects triplets from our dataset. For each triplet, we use the same code to transform two input videos into videos with the same effect. We use one of these as the reference effect video, and the other input output pair as the input video and target video, correspondingly. Each row corresponds to a different code-based effect triplet from our dataset

Reference Effect
Input Video
Target Video
See prompt
Keep the woman unchanged. Transition the between the input video and an edited version of the input video with the following editing instruction: Posterize the rest of the video with a color palette of Rich Loam, Burnt Sienna, Goldenrod, Parchment. Transition between frames 2 and 11 using the following temporal effect: bottom right to top left diagonal wipe
See prompt
Transition the between the input person and an edited version of the input person with the following editing instruction: Create a photocopy effect on the person with a contrast of 3.33 and a strength of 0.74. Transition between frames 5 and 25 using the following temporal effect: diamond out with center (0.25, 0.74)
See prompt
Transition the between the input person and an edited version of the input person with the following editing instruction: Create a retro, dithered effect on the person with a pixel block size of 16 and using 2 color steps per channel.. Transition between frames 9 and 25 using the following temporal effect: bottom right to top left diagonal wipe
See prompt
Keep the woman unchanged. Transition the between the input video and an edited version of the input video with the following editing instruction: Create a CC Ball Action effect on the rest of the video with a grid spacing of 10 and ball color of light Red. Transition between frames 22 and 30 using the following temporal effect: circle out with center (0.49, 0.13)

Training Data: LoRA-Based Reference Video + First Frame to Video Triplet Examples

We show examples of the LoRA-Based effects triplets from our dataset. For each triplet, we use the same LoRA adapters to generate the reference effect video and target video from their corresponding input frames. Each row corresponds to a different LoRA-based effect triplet from our dataset

Reference Effect
Input Image
Target Video
See prompt
The video starts with an image of a black man. The m0n4 Mona Lisa transformation begins as a dark sheet seems to wrap around the black man, and when the image resolves, the black man is depicted as a Mona Lisa version of itself. The Mona Lisa version sits in a chair with a backdrop featuring a landscape painting.
See prompt
The video begins with a black man. The black man begins the 54mur41 samurai transformation, and becomes a samurai. The black man is wearing a traditional samurai outfit, and is holding a katana. The background behind the black man is a misty mountainous landscape.
See prompt
In the video, a miniature black man is presented. The black man is held in a person's hands. The person then presses on the black man, causing a sq41sh squish effect. The person keeps pressing down on the black man, further showing the sq41sh squish effect.
See prompt
A latino man in a natural setting. The r8b8t1c robotic face reveal starts with subtle movements, then lines appear on their face, their expression shifting as the metallic, robotic visage takes form, revealing the interior mechanism.