MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun

ICLR 2026

This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".

🏠Project Page | 📑Paper | 🤗Models & Logs

🌟 News

🔥 [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
🔥 [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
🔥 [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
🔥 [2025-8-26] Our paper MemoryVLA is now on arxiv!

Overview

MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.

We release two versions of the code in separate branches:

MemoryVLA: built upon the OpenVLA codebase.
MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.

TODO

All components are now available, and we will continue to refine and improve the code.

This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.

Model Zoo & Benchmark Results
Install
Training
Evaluation in SimplerEnv
Evaluation in LIBERO
Deployment in The Real World
FAQ
Citation

Model Zoo & Benchmark Results

All datasets use only third-person RGB and language, without using wrist-view images or state.
MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.

Bridge

Model	Spoon	Carrot	Cube	Eggplant	Avg.	CKPT & Logs
MemoryVLA	75.0	75.0	37.5	100.0	71.9	🤗 HF
MemoryVLA+	100.0	66.7	70.8	100.0	84.4	🤗 HF

LIBERO

Model	Spatial	Object	Goal	Long-10	Long-90	Avg.	CKPT & Logs
MemoryVLA	98.4	98.4	96.4	93.4	95.6	96.5	🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100
MemoryVLA+	98.2	97.8	96.4	93.6	96.2	96.5	🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100
MemoryVLA+ (mix)	97.2	99.2	98.4	93.2	97.2	97.1	🤗 HF

Fractal-VM

Model	Coke Can	Move Near	Open/Close Drawer	Put In Drawer	Avg.	CKPT & Logs
MemoryVLA	90.7	88.0	84.7	47.2	77.7	🤗 HF
MemoryVLA+	92.0	91.7	71.8	-	-	🤗 HF

Fractal-VA

Model	Coke Can	Move Near	Open/Close Drawer	Put In Drawer	Avg.	CKPT & Logs
MemoryVLA	80.5	78.8	53.2	58.3	67.7	🤗 HF
MemoryVLA+	83.5	81.8	63.2	-	-	🤗 HF

Maniskill2

Model	Pick Cube	Stack Cube	Pick Single YCB	Pick Single EGAD	Pick Clutter YCB	Avg.	CKPT & Logs
MemoryVLA+	85	75	60	85	45	70	🤗 HF

Install

The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).

We recommend using Miniconda and setting up an environment:

conda create --name memvla python=3.10
conda activate memvla

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -y

If you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:

# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Next, clone our repo and install the required packages:

git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .

If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
“Floating point exception (core dumped)”, try reinstalling the specific cuBLAS version below:

# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8

Training

Prepare training dataset with RLDS format:

LIBERO (including Spatial, Object, Goal, Long-10, Long-90 suites)
Bridge from Open X-Embodiment (OXE)
Fractal from Open X-Embodiment (OXE)

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
# Download the LIBERO dataset (processed, ~22 GB)
git clone https://huggingface.co/datasets/shihao1895/libero-rlds
# Download the Bridge dataset (processed, ~157 GB)
git clone https://huggingface.co/datasets/shihao1895/bridge-rlds
# Download the Fractal dataset (processed)
git clone https://huggingface.co/datasets/shihao1895/fractal-rlds

Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.

# Download OpenVLA pretrained checkpoint (~30 GB)
git clone https://huggingface.co/openvla/openvla-7b-prismatic

# Download CogACT pretrained checkpoint (~31 GB)
git clone https://huggingface.co/CogACT/CogACT-Large

Train the model on different datasets

Before training, modify several parameters in the corresponding scripts, such as hf_token, wandb_entity, checkpoint paths, dataset paths, and log directories.

We train on a single node with 8× NVIDIA A100 GPUs.
```
# Train on the Bridge dataset
bash script/train/bridge/train_bridge.sh
# Train on the LIBERO-Spatial dataset
bash script/train/libero/train_libero_spatial.sh
# Train on the LIBERO-Object dataset
bash script/train/libero/train_libero_object.sh
# Train on the LIBERO-Goal dataset
bash script/train/libero/train_libero_goal.sh
# Train on the LIBERO-100 dataset
bash script/train/libero/train_libero_100.sh
# Train on the Fractal dataset
bash script/train/fractal/train_fractal.sh
# Train on real-world data
bash script/train/real_world/train_real.sh
```
To finetune on your own customized data, please follow the instruction (rlds_dataset_builder) for converting your data to RLDS format. The actions should be the deltas of end effector EEF Delta XYZ (3) + Roll-Pitch-Yaw (3) + Gripper Open/Close (1). Once your customized data is ready, place the customized data directly under the <data_root_dir>/custom_finetuning/1.0.0 directory. Then set vla.data_mix="custom_finetuning".

Evaluation in SimplerEnv

We provide evaluation interfaces and scripts based on SimplerEnv.

Please follow the installation guide in the SimplerEnv Repo to set up the simulation environment, and make sure to place the repo under: ./third_libs/SimplerEnv
Evaluation Example.
```
# Run evaluation
bash script/eval/bridge/eval_bridge.sh
# Summarize results
python script/eval/bridge/extract_bridge_results.py
```
NOTE: Due to the instability of the SimplerEnv benchmark and diffusion process, the performance scores across different iterations can vary significantly. Please evaluate multiple checkpoints and report the best result.

Evaluation in LIBERO

We also provide evaluation interfaces and scripts based on LIBERO.

Please follow the installation guide in the LIBERO Repo to set up the simulation environment, and make sure to place the repo under: ./third_libs/LIBERO
Evaluation Example.
```
# Run evaluation
bash script/eval/libero/eval_libero.sh
# Summarize results
python script/eval/libero/extract_libero_results.py
```
NOTE: The evaluation mechanism here is different from SimplerEnv. The process first loads the model using develop.py, then waits for a period before running evaluation/libero/eval_libero.py for testing. In addition, since performance may vary across iterations, please evaluate multiple checkpoints and report the best result.

Deployment in the Real World

To deploy the model on your own robot, first collect corresponding real-world manipulation data (e.g., via teleoperation), and use it to fine-tune the pretrained model.

Next, set up the server and client as shown in deploy.py, and deploy the system on your real robot.

The following command launches the server:

bash script/eval/real_world/deploy.sh

The robot acts as the client, and for each request it must send the following three items to obtain the action chunking result. The field episode_first_frame is a string ('True' or 'False') indicating whether the current frame is the first frame of the episode.

image = request.files['image']
query = request.form['text']
episode_first_frame = request.form['episode_first_frame']

This deployment process follows a similar design to OpenVLA and CogACT.

FAQ

SimplerEnv and ManiSkill may involve several dependency issues during installation. Below are some common troubleshooting tips based on our experience.

(1) Vulkan / SAPIEN issues
Example errors:
ImportError: libvulkan.so.1: cannot open shared object file: No such file or directory
Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.

Fix:

sudo apt install -y libegl1-mesa libgl1-mesa-dev libgles2-mesa-dev

and reference:
https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting

Note: Check that the .json files correctly link to the .so file corresponding to your current NVIDIA driver version. Use nvidia-smi to check your driver version and locate the correct .so under /usr/lib/x86_64-linux-gnu/.

(2) OpenGL issues
Example errors:
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Fix:

sudo apt install -y libgl1 libglib2.0-0 libglx-mesa0 libopengl0 libglu1-mesa mesa-utils

(3) Video recording in SimplerEnv

sudo apt install -y ffmpeg

(4) Benchmark Score Fluctuations

Benchmark scores tend to fluctuate, so we recommend evaluating checkpoints at regular iteration intervals and reporting the best result. Moreover, we have observed that even slight differences in Conda package versions may lead to small variations in the scores.

Citation

If you find our work helpful in your research, please consider citing our paper.

@article{shi2025memoryvla,
  title={MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation},
  author={Shi, Hao and Xie, Bin and Liu, Yingfei and Sun, Lin and Liu, Fengrong and Wang, Tiancai and Zhou, Erjin and Fan, Haoqiang and Zhang, Xiangyu and Huang, Gao},
  journal={arXiv preprint arXiv:2508.19236},
  year={2025}
}

@article{dexbotic,
  title={Dexbotic: Open-Source Vision-Language-Action Toolbox},
  author={Dexbotic Contributors},
  journal={arXiv preprint arXiv:2510.23511},
  year={2025}
}

shihao1895/MemoryVLA