TencentARC/VerseCrafter
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Sixiao Zheng1,2 Minghao Yin3 Wenbo Hu4† Xiaoyu Li4 Ying Shan4 Yanwei Fu1,2†
1Fudan University 2Shanghai Innovation Institute 3HKU 4ARC Lab, Tencent PCG
†Corresponding authors
CVPR 2026
✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.
🔥 News
- [Feb 21, 2026] 🎉 VerseCrafter is accepted to CVPR 2026!
- [Jan 9, 2026] 🚀 VerseCrafter is released! We publish the arXiv preprint, inference code, and model checkpoints.
✅ TODO
- Inference code
- Training code
- Data processing code
TL;DR
- Dynamic Realistic Video World Model: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence.
- 4D Geometric Control: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness.
- Frozen Video Prior + GeoAdapter: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality.
- VerseControl4D Dataset: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation.
Installation
-
Clone the repository:
git clone --recursive https://github.com/TencentARC/VerseCrafter.git # If you have already cloned the repo, you can update the submodules manually: git submodule update --init --recursive cd VerseCrafter
-
Create and activate the Conda environment:
conda create -n versecrafter python=3.11 -y conda activate versecrafter # Install PyTorch conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y # Install Python dependencies pip install -r requirements.txt # Install MoGe pip install git+https://github.com/microsoft/MoGe.git # Install Grounded-SAM-2 cd third_party/Grounded-SAM-2 pip install -e . pip install --no-build-isolation -e grounding_dino # Install flash attention pip install flash-attn --no-build-isolation # Install pytorch3d cd ../../ git clone https://github.com/facebookresearch/pytorch3d.git cd pytorch3d pip install --no-build-isolation . cd ../VerseCrafter
Download Checkpoints
-
Download VerseCrafter and Wan2.1 models:
pip install --upgrade huggingface_hub mkdir -p model hf download --local-dir model/VerseCrafter sxzheng/VerseCrafter hf download --local-dir model/Wan2.1-T2V-14B Wan-AI/Wan2.1-T2V-14B
-
Download Grounded-SAM-2 and Grounding DINO checkpoints:
cd third_party/Grounded-SAM-2/checkpoints bash download_ckpts.sh cd ../gdino_checkpoints bash download_ckpts.sh cd ../../../
Usage
We provide two ways to use VerseCrafter:
| Method | Description | Pros | Cons |
|---|---|---|---|
| Blender Addon | Deploy API server on GPU machine, call models directly from Blender | One-stop workflow, no context switching, visual trajectory editing | Requires network access to GPU server |
| Script Pipeline | Run each step manually via command line | Works offline, full control over each step | Requires manual switching between terminal and Blender |
💡 Tip: We recommend the Blender Addon for most users. It supports proxy authentication for secure server access. If you cannot connect to a remote GPU server, use the Script Pipeline instead.
Option 1: Blender Addon (Recommended)
For detailed instructions, see README_BLENDER.md.
Prerequisites
- Blender 4.0+ (4.5+ recommended)
- A remote GPU server running the VerseCrafter API
Quick Start
-
Install the addon:
cd VerseCrafter zip -r blender_addon.zip blender_addon/In Blender: Edit → Preferences → Add-ons → ↓ → Install from Disk... → Select
blender_addon.zip→ Enable "VerseCrafter Workflow" -
Start the API server (on GPU server):
python api_server.py --port 8188 --num_gpus 8
-
Configure connection in Blender:
- Press
Nto open the sidebar → VerseCrafter tab - Set Server URL (e.g.,
http://<server-ip>:8188) - Click Test Connection
- Press
-
Run the workflow:
- Step 1: Select input image, set workflow directory, enter object prompt (e.g., "person . car ."), click "Run Preprocessing"
- Step 2: Edit camera and object trajectories visually, click "Export Trajectories"
- Step 3: Enter video prompt, click "Generate Video"
Option 2: Script Pipeline
The inference.sh script provides a complete pipeline for generating videos. You can run the steps individually or use the script as a reference.
1. Configuration
Edit inference.sh to set your input image, output directory, and prompt.
INPUT_IMAGE=demo_data/y57HgqX1uGc_0039750_0041550_0000635_0000716/0001.jpg
OUTPUT_DIR=demo_data/y57HgqX1uGc_0039750_0041550_0000635_0000716
MODEL_PATH="model/VerseCrafter"2. Run the Pipeline
The pipeline consists of the following steps:
Step 1: Depth Estimation
Generate depth maps using MoGE-V2.
python inference/moge-v2_infer.py -i $INPUT_IMAGE -o $OUTPUT_DIR/estimated_depth --mapsStep 2: Segmentation
Segment objects using Grounded-SAM-2.
python inference/grounded_sam2_infer.py \
--image_path "$INPUT_IMAGE" \
--text_prompt "person . car ." \
--output_dir "$OUTPUT_DIR/object_mask" \
--min_area_ratio 0.003 \
--max_area_ratio 0.2Step 3: Fit 3D Gaussian
Fit 3D Gaussians to the segmented objects.
python inference/fit_3D_gaussian.py \
--image_path $INPUT_IMAGE \
--npz_path $OUTPUT_DIR/estimated_depth/depth_intrinsics.npz \
--masks_dir $OUTPUT_DIR/object_mask/masks \
--output_dir $OUTPUT_DIR/fitted_3D_gaussianThe following are input image and its corresponding results:
| Input Image | Depth Map | Segmentation Mask | 3D Gaussian |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Step 4: Customize Trajectory (Manual Operation in Blender)
This step requires Blender to interactively edit the 4D control scene. We also provide a demonstration video that shows step-by-step Blender operations for this process:
Watch the Blender operation video here
-
Prepare Scripts:
- Open
inference/blender_script/build_4d_control_scene.pyandinference/blender_script/export_blender_custom_trajectories.py. - Crucial: Update the
ROOT_DIRvariable in both scripts to the absolute path of your input directory (e.g.,/absolute/path/to/demo_data/your_folder).
- Open
-
Build Scene:
- Open Blender.
- Go to the Scripting tab.
- Open or paste the content of
build_4d_control_scene.py. - Run the script to load the scene (point cloud, camera, objects).
-
Customize Trajectories:
- Switch to the Layout tab.
- Camera Trajectory:
- Create a curve (e.g.,
Shift+A→ Curve → Bezier). - Switch to Edit Mode to draw or adjust the curve.
- Select the Camera, add a Follow Path constraint targeting the curve.
- Check Fixed Position.
- Set the animation duration to 81 frames.
- Create a curve (e.g.,
- 3D Gaussian (Object) Trajectory:
- Select the object (Ellipsoid).
- Use the same Follow Path method as the camera, or insert Keyframes (
Ikey) for location/rotation/scale.
-
Export Trajectories:
- Go back to the Scripting tab.
- Open or paste the content of
export_blender_custom_trajectories.py. - Run the script to export
custom_camera_trajectory.npzandcustom_3D_gaussian_trajectory.json.
This is an animation of custom trajectories in Blender:
Step 5: Render 4D Control Maps
Render the 4D control maps for the diffusion model.
python inference/rendering_4D_control_maps.py \
--png_path $INPUT_IMAGE \
--npz_path $OUTPUT_DIR/estimated_depth/depth_intrinsics.npz \
--mask_dir $OUTPUT_DIR/object_mask/masks \
--trajectory_npz $OUTPUT_DIR/camera_object_0/custom_camera_trajectory.npz \
--ellipsoid_json $OUTPUT_DIR/camera_object_0/custom_3D_gaussian_trajectory.json \
--output_dir $OUTPUT_DIR/camera_object_0/rendering_4D_mapsThe following are 4D control maps rendered from the this steps:
| Background RGB | Background Depth | 3D Gaussian RGB | 3D Gaussian Depth | Merged Mask |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
Step 6: VerseCrafter Inference
Generate the generated video.
torchrun --nproc-per-node=8 inference/versecrafter_inference.py \
--transformer_path $MODEL_PATH \
--num_inference_steps 30 \
--sample_size "720,1280" \
--ulysses_degree 2 \
--ring_degree 4 \
--prompt "A sun-drenched street in Valletta, Malta, showcasing towering honey-colored limestone buildings adorned with traditional wrought-iron balconies and arched doorways. On the left-hand sidewalk, a man in a bright orange T-shirt and a woman in a beige summer dress walk side-by-side. Several cars are parked in the distance. The vibrant Mediterranean sunlight casts soft shadows, illuminating the weathered textures of the ancient architecture, which stretches towards distant city fortifications under a clear, pale blue sky." \
--input_image_path $INPUT_IMAGE \
--save_path $OUTPUT_DIR/camera_object_0 \
--rendering_maps_path $OUTPUT_DIR/camera_object_0/rendering_4D_mapsAcknowledgements
Our codes are built upon MoGe, Grounded-SAM-2, VideoX-Fun, Wan2.1 and diffusers.
License
This project is released under the VerseCrafter License. It is intended for academic/research purposes only and commercial use is not permitted.
Star History
Citation
If you find this work useful, please consider citing:
@article{zheng2026versecrafter,
title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
journal={arXiv preprint arXiv:2601.05138},
year={2026}
}












