GitHunt
WE

[ICCV2025]LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

performance.mp4

We present LeanVAE, a lightweight Video VAE designed for ultra-efficient video compression and scalable generation in Latent Video Diffusion Models (LVDMs).

  • Lightweight & Efficient: Only 40M parameters, drastically reducing FLOPs, inference time, and memory usageπŸ“‰
  • Optimized for High-Resolution Videos: Encodes and decodes a 17-frame 1080p video in 0.9s / 3.0s with 6GB / 15GB of GPU memory (under 4Γ—8Γ—8 / 1Γ—8Γ—8 compression ratios) 🎯
  • State-of-the-Art Video Reconstruction: Competes with leading Video VAEs πŸ†
  • Robust Support for Long Videos:Ensures temporal consistency across varying frame lengths and inplement lossless temporal tiling inference for flexible processing of long sequences⏱️
  • Versatile: Supports both images and videos, preserving causality in latent space πŸ“½οΈ
  • Evidenced by Diffusion Model: Enhances visual quality in video generation ✨

πŸ› οΈ Installation

Clone the repository and install dependencies:

git clone https://github.com/westlake-repl/LeanVAE
cd LeanVAE
pip install -r requirements.txt

🎯 Quick Start

Train LeanVAE

bash scripts/train.sh

You can use pl_ckpt_inference.py to evaluate checkpoints saved during training. See this discussion.

Run Video Reconstruction

bash scripts/inference.sh

Evaluate Reconstruction Quality

bash scripts/eval.sh

πŸ“œ Pretrained Models

Video VAE Model:

Model PSNR ⬆️ LPIPS ⬇️ Params πŸ“¦ TFLOPs ⚑ Checkpoint πŸ“₯
LeanVAE-4ch 26.04 0.0899 39.8M 0.203 LeanVAE-chn4.ckpt
LeanVAE-16ch 30.15 0.0461 39.8M 0.203 LeanVAE-chn16.ckpt

Latte Model:

You can find the video generation code in the generation folder.

Model Dataset FVD ⬇️ Checkpoint πŸ“₯
Latte + LeanVAE-chn4 SkyTimelapse 49.59 sky-chn4.ckpt
Latte + LeanVAE-chn4 UCF101 164.45 ucf-chn4.ckpt
Latte + LeanVAE-chn16 SkyTimelapse 95.15 sky-chn16.ckpt
Latte + LeanVAE-chn16 UCF101 175.33 ucf-chn16.ckpt

πŸ”§ Using LeanVAE in Your Project

from LeanVAE import LeanVAE

# Load pretrained model
model = LeanVAE.load_from_checkpoint("path/to/ckpt", strict=False)

# πŸ”„ Encode & Decode an Image
image, image_rec = model.inference(image)

# πŸ–ΌοΈ Encode an image β†’ Get latent :  
latent = model.encode(image) # (B, C, H, W) β†’ (B, d, 1, H/8, W/8), where d=4 or 16

# πŸ–ΌοΈ Decode latent representation β†’ Reconstruct image 
image = model.decode(latent, is_image=True) # (B, d, 1, H/8, W/8) β†’ (B, C, H, W)  


# πŸ”„ Encode & Decode a Video
video, video_rec = model.inference(video) ## Frame count must be 4n+1 (e.g., 5, 9, 13, 17...)

# 🎞️ Encode Video β†’ Get Latent Space
latent = model.encode(video)  # (B, C, T+1, H, W) β†’ (B, d, T/4+1, H/8, W/8), where d=4 or 16 

# 🎞️ Decode Latent β†’ Reconstruct Video
video = model.decode(latent) # (B, d, T/4+1, H/8, W/8) β†’ (B, C, T+1, H, W)  

# ⚑ Enable **Temporal Tiling Inference** for Long Videos
model.set_tile_inference(True)
model.chunksize_enc = 5
model.chunksize_dec = 5

πŸ”¬ Further Explorations

πŸ“‚ Preparing Data for Training

To train LeanVAE, you need to create metadata files listing the video paths, grouped by resolution. Each file contains paths to videos of the same resolution.

πŸ“‚ data_list
 β”œβ”€β”€ πŸ“„ 96x128.txt  πŸ“œ  # Contains paths to all 96x128 videos
 β”‚   β”œβ”€β”€ /path/to/video_1.mp4
 β”‚   β”œβ”€β”€ /path/to/video_2.mp4
 β”‚   β”œβ”€β”€ ...
 β”œβ”€β”€ πŸ“„ 256x256.txt  πŸ“œ  # Contains paths to all 256Γ—256 videos
 β”‚   β”œβ”€β”€ /path/to/video_3.mp4
 β”‚   β”œβ”€β”€ /path/to/video_4.mp4
 β”‚   β”œβ”€β”€ ...
 β”œβ”€β”€ πŸ“„ 352x288.txt  πŸ“œ  # Contains paths to all 352x288 videos
 β”‚   β”œβ”€β”€ /path/to/video_5.mp4
 β”‚   β”œβ”€β”€ /path/to/video_6.mp4
 β”‚   β”œβ”€β”€ ...

πŸ“Œ Each text file lists video paths corresponding to a specific resolution. Set args.train_datalist to the folder containing these files.


πŸ“œ License

This project is released under the MIT License. See the LICENSE file for details.

πŸ”₯ Why Choose LeanVAE?

LeanVAE is fast, lightweight and powerful, enabling high-quality video compression and generation with minimal computational cost.

If you find this work useful, consider starring ⭐ the repository and citing our paper!


πŸ“ Cite Us

@misc{cheng2025leanvaeultraefficientreconstructionvae,
      title={LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models}, 
      author={Yu Cheng and Fajie Yuan},
      year={2025},
      eprint={2503.14325},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14325}, 
}

πŸ‘ Acknowledgement

Our work benefits from the contributions of several open-source projects, including OmniTokenizer, Open-Sora-Plan, VidTok, and Latte. We sincerely appreciate their efforts in advancing research and open-source collaboration!

westlake-repl/LeanVAE | GitHunt