AYLARDJ/MoRo
The official Pytorch code for MoRo: Masked Modeling for Human Motion Recovery Under Occlusions
Masked Modeling for Human Motion Recovery Under Occlusions
Project Page | Paper
Masked Modeling for Human Motion Recovery Under Occlusions
Zhiyin Qian,
Siwei Zhang,
Bharat Lal Bhatnagar,
Federica Bogo,
Siyu Tang
3DV 2026
Installation
git clone https://github.com/mikeqzy/MoRo
conda env create -f environment.yml
conda activate moro
Data preparation
SMPL(-X) body model
We use smplfitter to fit the non-parametric mesh to SMPL-X parameters. Please follow their provided script to download these files and put them under body_models.
Additionally, you can download the mesh connection matrices for SMPL-X topology used for the fully convolutional mesh autoencoder in Mesh-VQ-VAE and other regressors for evaluation here. Please also extract and put them under body_models .
Tokenization
We train the tokenizer on AMASS, MOYO and BEDLAM. Download the SMPL-X neutral annotations from their official project pages and unzip the files.
We preprocessed the datasets with the scripts at models/mesh_vq_vae/data/preprocess. Please change the dataset paths accordingly.
MoRo
MoRo is trained on a mixture of datasets prepared as follows:
-
AMASS:
We preprocess AMASS into 30 fps sequences following RoHM, please refer to the instructions here.
-
MPII, Human3.6M, MPI-INF-3DHP, COCO:
The training dataset uses the SMPL annotations from BEDLAM. Follow the instructions here in the section
Training CLIFF model with real imagesto obtain the required training images and annotations.We further convert the SMPL annotations to SMPL-X using the provided script at
scripts/preprocess/process_hmr_smplx.py. -
BEDLAM:
Download the BEDLAM dataset from their official project page. We use the SMPL-X neutral annotations for training.
-
EgoBody:
First, download the EgoBody dataset from the official EgoBody dataset.
Additionally, download
keypoints_cleaned,mask_jointfrom here andegobody_occ_info.csvfrom here, then place them under the dataset directory.Finally, run the provided preprocessing script at
scripts/preprocess/process_egobody_bbox.pyto generate the bounding box files.
Testing data
Additionally, we test our method on PROX and RICH. Follow the dataset-specific steps below to place files in the expected locations.
-
RICH:
Download the RICH dataset from their official project page. The preprocessed annotations can be downloaded from GVHMR repo. Put the
hmr4d_supportfolder under the RICH dataset directory. -
PROX:
First, download the PROX dataset from their official project page.
Additionally, download
keypoints_openposeandmask_jointfrom here and place them under the dataset directory.Finally, run the provided preprocessing script at
scripts/preprocess/process_prox_bbox.pyto generate the bounding box files.
Checkpoints
Make sure the required pretrained weights and released checkpoints are placed at the exact paths below.
The checkpoint for Mesh-VQ-VAE and MoRo can be downloaded here. Place the tokenizer checkpoint at ckpt/tokenizer/tokenizer.ckpt and MoRo checkpoint at exp/mask_transformer/MIMO-vit-release/video_train/checkpoints/last.ckpt.
To train from scratch, we use the pretrained weights from 4DHumans (download model from the Training section of official repo) for the ViT backbone. Place it at ckpt/backbones/vit_pose_hmr2.pth.
Structure
The data should be organized as follows:
MoRo
├── body_models
| ├── smpl
| ├── smplh
│ ├── smplx
│ ├── smplx_ConnectionMatrices
| ├── J_regressor_h36m.npy
| ├── smpl_neutral_J_regressor.pt
| ├── smplx2smpl_sparse.pt
├── ckpt
│ ├── backbones
│ │ ├── vit_pose_hmr2.pth
│ ├── tokenizer
│ │ ├── tokenizer.ckpt
├── exp
│ ├── mask_transformer
│ │ ├── MIMO-vit-release
│ │ │ ├── video_train
│ │ │ │ ├── checkpoints
│ │ │ │ │ ├── last.ckpt
├── datasets
│ ├── mesh_vq_vae
│ │ ├── bedlam_animations
│ │ ├── AMASS_smplx
│ │ ├── MOYO
│ ├── mask_transformer
│ │ ├── AMASS
│ │ ├── BEDLAM
│ │ ├── coco
│ │ ├── h36m_train
│ │ ├── mpi-inf-3dhp
│ │ ├── mpii
│ │ ├── EgoBody
│ │ ├── PROX
│ │ ├── rich
Demo
For a quick demo on custom video taken from static camera, run the following command on a 30 fps video:
python demo.py option=demo demo.video_path=/path/to/demo.mp4 demo.name=<video_name> demo.focal_length=<focal_length>
or an image directory with sorted frames:
python demo.py demo.video_path=/path/to/image_dir demo.name=<video_name> demo.focal_length=<focal_length>
By default, the rendering result will be saved to ./exp/mask_transformer/MIMO-vit-release/video_train, under the same directory of the released model checkpoint.
The focal length can be optionally provided. If not provided, it will be estimated via HumanFOV from CameraHMR.
Training
Training consists of (1) training the mesh tokenizer and (2) multi-stage training for MoRo. Configuration locations are listed in each subsection.
Tokenization
You can train the mesh tokenizer by running:
python train_mesh_vqvae.py
The configuration file is at configs/mesh_vq_vae/config.yaml.
MoRo
We adopt a multi-stage training strategy for MoRo:
# Stage 1: Pose pretraining
python train_mask_transformer.py option=pose_pretrain tag=default
# Stage 2: Motion pretraining
python train_mask_transformer.py option=motion_pretrain tag=default
# Stage 3: Image pretraining on image datasets
python train_mask_transformer.py option=image_pretrain tag=default
# Stage 4: Image pretraining on video datasets
python train_mask_transformer.py option=video_pretrain tag=default
# Stage 5: Finetuning on video datasets
python train_mask_transformer.py option=video_train tag=default
The configuration files are at configs/mask_transformer/config.yaml, the specific options for each training stage can be found in configs/option.
The training logs and checkpoints will be saved under exp/mask_transformer/MIMO-vit-<tag>/<stage>.
Testing and Evaluation
Inference writes results to exp/mask_transformer/MIMO-vit-<tag>/video_train, and then evaluation scripts read from the corresponding result directories.
We set tag=release here to reproduce the results reported on the paper.
EgoBody
python train_mask_transformer.py option=[inference,video_train] tag=release data=egobody
python eval_egobody.py --saved_data_dir=./exp/mask_transformer/MIMO-vit-release/video_train/result_egobody/inference_5_1 --recording_name=all --render
RICH
python train_mask_transformer.py option=[inference,video_train] tag=release data=rich
python eval_rich.py --saved_data_dir=./exp/mask_transformer/MIMO-vit-release/video_train/result_rich/inference_5_1 --seq_name=all --render
PROX
python train_mask_transformer.py option=[inference,video_train] tag=release data=prox
python eval_egobody.py --saved_data_dir=./exp/mask_transformer/MIMO-vit-release/video_train/result_prox/inference_5_1 --recording_name=all --render
Acknowledgements
This work was supported as part of the Swiss AI initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project IDs #36 on Alps, enabling large-scale training.
Some code in this repository is adapted from the following repositories:
Citation
If you find this code useful for your research, please use the following BibTeX entry.
@inproceedings{qian2026moro,
title={Masked Modeling for Human Motion Recovery Under Occlusions},
author={Qian, Zhiyin and Zhang, Siwei and Bhatnagar, Bharat Lal and Bogo, Federica and Tang, Siyu},
booktitle={3DV},
year={2026}
}
