FastVMT⚡: Eliminating Redundancy in
Video Motion Transfer

ICLR 2026
Yue Ma2,† Zhikai Wang1,† Tianhao Ren1,† Mingzhe Zheng2 Hongyu Liu2 Jiayi Guo3 Mark Fong Yuxuan Xue4 Zixiang Zhao5 Konrad Schindler5 Qifeng Chen2 Linfeng Zhang1,✉
1 EPIC Lab, SJTU 2 HKUST 3 THU 4 Meta 5 ETH Zürich
Equal contribution Corresponding Author
Abstract

Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.

Motivation

Motivation of our method. Training-free video motion transfer can benefit from redundancies, both at the level of the DiT architecture and of the iterative diffusion process. (a) Motion redundancy: Video motion is small and locally consistent, so a motion token in one frame will only ever match tokens in the next frame within a local neighborhood. (b) Gradient redundancy: Gradient updates in consecutive optimization steps are mostly similar (visualized here with PCA). There is no need to recompute them at every single step.

Motivation figure
Method

Overview of our method. Left: Given a reference video, we leverage the sliding window strategy to extract motion embeddings from attention during the inversion stage. At the denoising stage, we calculate the total loss and leverage the step-skipping gradient optimization to guide the video generation. Right: We visualize the effectiveness of these two core designs. Top-Right: The sliding window eliminates incorrect global correspondences (e.g., matching the person to the background) to ensure precise motion alignment. Bottom-Right: The PCA analysis demonstrates that the optimization trajectory remains stable even when skipping gradient steps, proving that our method reduces redundancy without degrading motion transfer performance.

Method overview
Corresponding window Evaluation
Visual Results
Given reference videos with distinct motion dynamics, our model generates videos that faithfully transfer the motion patterns while aligning with different prompts.
BibTeX
@inproceedings{fastvmt2026,
  title     = {FastVMT: Eliminating Redundancy in Video Motion Transfer},
  author    = {Ma, Yue and Wang, Zhikai and Ren, Tianhao and Zheng, Mingzhe and Liu, Hongyu and Guo, Jiayi and Feng, Kunyu and Xue, Yuxuan and Zhao, Zixiang and Schindler, Konrad and Chen, Qifeng and Zhang, Linfeng},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}