shihao1895/MemoryVLA
[ICLR 2026] Code of "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation"
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang
Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun
ICLR 2026
This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".
🏠Project Page | 📑Paper | 🤗Models & Logs
🌟 News
- 🔥 [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
- 🔥 [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
- 🔥 [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
- 🔥 [2025-8-26] Our paper MemoryVLA is now on arxiv!
Overview
MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.
We release two versions of the code in separate branches:
- MemoryVLA: built upon the OpenVLA codebase.
- MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.
TODO
All components are now available, and we will continue to refine and improve the code.
-
Code Release
- MemoryVLA (OpenVLA codebase)
- MemoryVLA+ (Dexbotic codebase)
-
Model Weights Release
-
Dataset Upload to HuggingFace
Contents
This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.
- Model Zoo & Benchmark Results
- Install
- Training
- Evaluation in SimplerEnv
- Evaluation in LIBERO
- Deployment in The Real World
- FAQ
- Citation
Model Zoo & Benchmark Results
All datasets use only third-person RGB and language, without using wrist-view images or state.
MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.
Bridge
| Model | Spoon | Carrot | Cube | Eggplant | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|
| MemoryVLA | 75.0 | 75.0 | 37.5 | 100.0 | 71.9 | 🤗 HF |
| MemoryVLA+ | 100.0 | 66.7 | 70.8 | 100.0 | 84.4 | 🤗 HF |
LIBERO
| Model | Spatial | Object | Goal | Long-10 | Long-90 | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|---|
| MemoryVLA | 98.4 | 98.4 | 96.4 | 93.4 | 95.6 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 |
| MemoryVLA+ | 98.2 | 97.8 | 96.4 | 93.6 | 96.2 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 |
| MemoryVLA+ (mix) | 97.2 | 99.2 | 98.4 | 93.2 | 97.2 | 97.1 | 🤗 HF |
Fractal-VM
| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|
| MemoryVLA | 90.7 | 88.0 | 84.7 | 47.2 | 77.7 | 🤗 HF |
| MemoryVLA+ | 92.0 | 91.7 | 71.8 | - | - | 🤗 HF |
Fractal-VA
| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|
| MemoryVLA | 80.5 | 78.8 | 53.2 | 58.3 | 67.7 | 🤗 HF |
| MemoryVLA+ | 83.5 | 81.8 | 63.2 | - | - | 🤗 HF |
Maniskill2
| Model | Pick Cube | Stack Cube | Pick Single YCB | Pick Single EGAD | Pick Clutter YCB | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|---|
| MemoryVLA+ | 85 | 75 | 60 | 85 | 45 | 70 | 🤗 HF |
Install
The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).
We recommend using Miniconda and setting up an environment:
conda create --name memvla python=3.10
conda activate memvla
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -yIf you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:
# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whlNext, clone our repo and install the required packages:
git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
“Floating point exception (core dumped)”, try reinstalling the specific cuBLAS version below:
# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8Training
-
Prepare training dataset with RLDS format:
- LIBERO (including Spatial, Object, Goal, Long-10, Long-90 suites)
- Bridge from Open X-Embodiment (OXE)
- Fractal from Open X-Embodiment (OXE)
# Make sure you have git-lfs installed (https://git-lfs.com) git lfs install # Download the LIBERO dataset (processed, ~22 GB) git clone https://huggingface.co/datasets/shihao1895/libero-rlds # Download the Bridge dataset (processed, ~157 GB) git clone https://huggingface.co/datasets/shihao1895/bridge-rlds # Download the Fractal dataset (processed) git clone https://huggingface.co/datasets/shihao1895/fractal-rlds
-
Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.
# Download OpenVLA pretrained checkpoint (~30 GB) git clone https://huggingface.co/openvla/openvla-7b-prismatic # Download CogACT pretrained checkpoint (~31 GB) git clone https://huggingface.co/CogACT/CogACT-Large
-
Train the model on different datasets
Before training, modify several parameters in the corresponding scripts, such as
hf_token,wandb_entity, checkpoint paths, dataset paths, and log directories.We train on a single node with 8× NVIDIA A100 GPUs.
# Train on the Bridge dataset bash script/train/bridge/train_bridge.sh # Train on the LIBERO-Spatial dataset bash script/train/libero/train_libero_spatial.sh # Train on the LIBERO-Object dataset bash script/train/libero/train_libero_object.sh # Train on the LIBERO-Goal dataset bash script/train/libero/train_libero_goal.sh # Train on the LIBERO-100 dataset bash script/train/libero/train_libero_100.sh # Train on the Fractal dataset bash script/train/fractal/train_fractal.sh # Train on real-world data bash script/train/real_world/train_real.sh
To finetune on your own customized data, please follow the instruction (rlds_dataset_builder) for converting your data to RLDS format. The actions should be the deltas of end effector
EEF Delta XYZ (3) + Roll-Pitch-Yaw (3) + Gripper Open/Close (1). Once your customized data is ready, place the customized data directly under the<data_root_dir>/custom_finetuning/1.0.0directory. Then setvla.data_mix="custom_finetuning".
Evaluation in SimplerEnv
We provide evaluation interfaces and scripts based on SimplerEnv.
-
Please follow the installation guide in the SimplerEnv Repo to set up the simulation environment, and make sure to place the repo under:
./third_libs/SimplerEnv -
Evaluation Example.
# Run evaluation bash script/eval/bridge/eval_bridge.sh # Summarize results python script/eval/bridge/extract_bridge_results.py
NOTE: Due to the instability of the SimplerEnv benchmark and diffusion process, the performance scores across different iterations can vary significantly. Please evaluate multiple checkpoints and report the best result.
Evaluation in LIBERO
We also provide evaluation interfaces and scripts based on LIBERO.
-
Please follow the installation guide in the LIBERO Repo to set up the simulation environment, and make sure to place the repo under:
./third_libs/LIBERO -
Evaluation Example.
# Run evaluation bash script/eval/libero/eval_libero.sh # Summarize results python script/eval/libero/extract_libero_results.py
NOTE: The evaluation mechanism here is different from SimplerEnv. The process first loads the model using
develop.py, then waits for a period before runningevaluation/libero/eval_libero.pyfor testing. In addition, since performance may vary across iterations, please evaluate multiple checkpoints and report the best result.
Deployment in the Real World
To deploy the model on your own robot, first collect corresponding real-world manipulation data (e.g., via teleoperation), and use it to fine-tune the pretrained model.
Next, set up the server and client as shown in deploy.py, and deploy the system on your real robot.
The following command launches the server:
bash script/eval/real_world/deploy.shThe robot acts as the client, and for each request it must send the following three items to obtain the action chunking result. The field episode_first_frame is a string ('True' or 'False') indicating whether the current frame is the first frame of the episode.
image = request.files['image']
query = request.form['text']
episode_first_frame = request.form['episode_first_frame']This deployment process follows a similar design to OpenVLA and CogACT.
FAQ
SimplerEnv and ManiSkill may involve several dependency issues during installation. Below are some common troubleshooting tips based on our experience.
(1) Vulkan / SAPIEN issues
Example errors:
ImportError: libvulkan.so.1: cannot open shared object file: No such file or directory
Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.
Fix:
sudo apt install -y libegl1-mesa libgl1-mesa-dev libgles2-mesa-devand reference:
https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting
Note: Check that the .json files correctly link to the .so file corresponding to your current NVIDIA driver version. Use
nvidia-smito check your driver version and locate the correct .so under /usr/lib/x86_64-linux-gnu/.
(2) OpenGL issues
Example errors:
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
Fix:
sudo apt install -y libgl1 libglib2.0-0 libglx-mesa0 libopengl0 libglu1-mesa mesa-utils(3) Video recording in SimplerEnv
sudo apt install -y ffmpeg(4) Benchmark Score Fluctuations
Benchmark scores tend to fluctuate, so we recommend evaluating checkpoints at regular iteration intervals and reporting the best result. Moreover, we have observed that even slight differences in Conda package versions may lead to small variations in the scores.
Citation
If you find our work helpful in your research, please consider citing our paper.
@article{shi2025memoryvla,
title={MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation},
author={Shi, Hao and Xie, Bin and Liu, Yingfei and Sun, Lin and Liu, Fengrong and Wang, Tiancai and Zhou, Erjin and Fan, Haoqiang and Zhang, Xiangyu and Huang, Gao},
journal={arXiv preprint arXiv:2508.19236},
year={2025}
}
@article{dexbotic,
title={Dexbotic: Open-Source Vision-Language-Action Toolbox},
author={Dexbotic Contributors},
journal={arXiv preprint arXiv:2510.23511},
year={2025}
}