Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA οΌCVPR 2026 FindingsοΌ
Customizing a dedicated semantic LoRA for each reference video.
π arXiv | π Project Page (coming soon)
Official implementation of the paper "Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA".
Video2LoRA enables semantic video generation by dynamically predicting lightweight LoRA adapters from reference videos using a HyperNetwork, without requiring per-condition fine-tuning.
π₯ Highlights

Video2LoRA introduces a new paradigm for semantic-controlled video generation.
Instead of training separate models or LoRA adapters for each semantic condition (e.g., visual effects, camera motion, style), our framework predicts semantic-specific LoRA weights directly from a reference video.
Key features:
- π¬ Reference-driven semantic video generation
- β‘ Ultra-lightweight LoRA (<50 KB per semantic condition)
- π§ Transformer-based HyperNetwork for LoRA prediction
- π Strong zero-shot generalization
- π§© Unified framework across heterogeneous semantic controls
π§ Method Overview
Video2LoRA consists of three key components:
1. LightLoRA Representation
We introduce LightLoRA, a compact LoRA formulation that decomposes the standard LoRA matrices:
Where:
-
$A_{\text{aux}}, B_{\text{aux}}$ : trainable auxiliary matrices -
$A_{\text{pred}}, B_{\text{pred}}$ : predicted by the HyperNetwork
This design significantly reduces parameter size while preserving semantic adaptability.
Each semantic condition requires less than 50 KB parameters.
2. HyperNetwork for LoRA Prediction
A Transformer-based HyperNetwork predicts semantic-specific LoRA weights conditioned on a reference video.
Pipeline:
Reference Video
β
3D VAE Encoder
β
Spatio-temporal features
β
Transformer Decoder
β
Predicted LoRA weights
These predicted LoRA modules are injected into the frozen diffusion backbone.
3. End-to-End Diffusion Training
Unlike prior methods that require:
- pretrained semantic LoRA weights
- multi-stage training pipelines
Video2LoRA is trained end-to-end using only the standard diffusion objective.
π Zero-Shot Semantic Generation
Video2LoRA generalizes well to unseen semantic conditions.
Even when encountering out-of-domain visual effects, the model can generate semantically aligned videos based on reference videos.
Example semantic controls include:
- visual effects (VFX)
- camera motion
- object stylization
- character transformations
- artistic styles
π Dataset
Video2LoRA follows the dataset format used in VideoX-Fun, which supports mixed image and video training with text descriptions.
Organize your dataset in the following structure:
project/
β
βββ datasets/
β βββ internal_datasets/
β β
β βββ train/
β β βββ 00000001.mp4
β β βββ 00000002.jpg
β β βββ 00000003.mp4
β β βββ ...
β β
β βββ json_of_internal_datasets.json
π JSON Annotation Format
[
{
"file_path": "train/00000001.mp4",
"text": "A group of young men in suits and sunglasses walking down a city street.",
"type": "video"
},
{
"file_path": "train/00000002.jpg",
"text": "A group of young men in suits and sunglasses walking down a city street.",
"type": "image"
}
]βοΈ Installation
Clone repository
git clone https://github.com/BerserkerVV/Video2LoRA.git
cd Video2LoRACreate environment
conda create -n video2lora python=3.10
conda activate video2loraInstall dependencies
pip install -r requirements.txtπ Training
Train Video2LoRA:
bash scripts/cogvideoxfun/train_lora.shTraining setup:
| Item | Value |
|---|---|
| Backbone | CogVideoX-Fun-V1.1-5b-InP |
| GPUs | 8 Γ NVIDIA A800 |
| Iterations | 20K |
| Frames | 49 |
| FPS | 8 |
| Resolution | 512, 768, 1024, 1280 |
π₯ Inference
Generate a video using a reference video:
bash examples/cogvideox_fun/run_predict_i2v.shπ Citation
If you find our work useful, please cite:
@misc{wu2026video2loraunifiedsemanticcontrolledvideo,
title={Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA},
author={Zexi Wu and Qinghe Wang and Jing Dai and Baolu Li and Yiming Zhang and Yue Ma and Xu Jia and Hongming Xu},
year={2026},
eprint={2603.08210},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08210},
}If you find this project useful, please consider starring the repository to support our work.
