GitHunt
SH

shihao1895/MemoryVLA

[ICLR 2026] Code of "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation"

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun

ICLR 2026

This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".

🏠Project Page | 📑Paper | 🤗Models & Logs

🌟 News

  • 🔥 [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
  • 🔥 [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
  • 🔥 [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
  • 🔥 [2025-8-26] Our paper MemoryVLA is now on arxiv!

Overview

MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.

MemoryVLA Overview

We release two versions of the code in separate branches:

  • MemoryVLA: built upon the OpenVLA codebase.
  • MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.

TODO

All components are now available, and we will continue to refine and improve the code.

  • Code Release

    • MemoryVLA (OpenVLA codebase)
    • MemoryVLA+ (Dexbotic codebase)
  • Model Weights Release

  • Dataset Upload to HuggingFace

Contents

This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.

Model Zoo & Benchmark Results

All datasets use only third-person RGB and language, without using wrist-view images or state.
MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.

Bridge

Model Spoon Carrot Cube Eggplant Avg. CKPT & Logs
MemoryVLA 75.0 75.0 37.5 100.0 71.9 🤗 HF
MemoryVLA+ 100.0 66.7 70.8 100.0 84.4 🤗 HF

LIBERO

Model Spatial Object Goal Long-10 Long-90 Avg. CKPT & Logs
MemoryVLA 98.4 98.4 96.4 93.4 95.6 96.5 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100
MemoryVLA+ 98.2 97.8 96.4 93.6 96.2 96.5 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100
MemoryVLA+ (mix) 97.2 99.2 98.4 93.2 97.2 97.1 🤗 HF

Fractal-VM

Model Coke Can Move Near Open/Close Drawer Put In Drawer Avg. CKPT & Logs
MemoryVLA 90.7 88.0 84.7 47.2 77.7 🤗 HF
MemoryVLA+ 92.0 91.7 71.8 - - 🤗 HF

Fractal-VA

Model Coke Can Move Near Open/Close Drawer Put In Drawer Avg. CKPT & Logs
MemoryVLA 80.5 78.8 53.2 58.3 67.7 🤗 HF
MemoryVLA+ 83.5 81.8 63.2 - - 🤗 HF

Maniskill2

Model Pick Cube Stack Cube Pick Single YCB Pick Single EGAD Pick Clutter YCB Avg. CKPT & Logs
MemoryVLA+ 85 75 60 85 45 70 🤗 HF

Install

The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).

We recommend using Miniconda and setting up an environment:

conda create --name memvla python=3.10
conda activate memvla

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -y

If you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:

# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Next, clone our repo and install the required packages:

git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .

If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
“Floating point exception (core dumped)”, try reinstalling the specific cuBLAS version below:

# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8

Training

  1. Prepare training dataset with RLDS format:

    # Make sure you have git-lfs installed (https://git-lfs.com)
    git lfs install
    # Download the LIBERO dataset (processed, ~22 GB)
    git clone https://huggingface.co/datasets/shihao1895/libero-rlds
    # Download the Bridge dataset (processed, ~157 GB)
    git clone https://huggingface.co/datasets/shihao1895/bridge-rlds
    # Download the Fractal dataset (processed)
    git clone https://huggingface.co/datasets/shihao1895/fractal-rlds
  2. Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.

    # Download OpenVLA pretrained checkpoint (~30 GB)
    git clone https://huggingface.co/openvla/openvla-7b-prismatic
    
    # Download CogACT pretrained checkpoint (~31 GB)
    git clone https://huggingface.co/CogACT/CogACT-Large
  3. Train the model on different datasets

    Before training, modify several parameters in the corresponding scripts, such as hf_token, wandb_entity, checkpoint paths, dataset paths, and log directories.

    We train on a single node with 8× NVIDIA A100 GPUs.

    # Train on the Bridge dataset
    bash script/train/bridge/train_bridge.sh
    # Train on the LIBERO-Spatial dataset
    bash script/train/libero/train_libero_spatial.sh
    # Train on the LIBERO-Object dataset
    bash script/train/libero/train_libero_object.sh
    # Train on the LIBERO-Goal dataset
    bash script/train/libero/train_libero_goal.sh
    # Train on the LIBERO-100 dataset
    bash script/train/libero/train_libero_100.sh
    # Train on the Fractal dataset
    bash script/train/fractal/train_fractal.sh
    # Train on real-world data
    bash script/train/real_world/train_real.sh

    To finetune on your own customized data, please follow the instruction (rlds_dataset_builder) for converting your data to RLDS format. The actions should be the deltas of end effector EEF Delta XYZ (3) + Roll-Pitch-Yaw (3) + Gripper Open/Close (1). Once your customized data is ready, place the customized data directly under the <data_root_dir>/custom_finetuning/1.0.0 directory. Then set vla.data_mix="custom_finetuning".

Evaluation in SimplerEnv

We provide evaluation interfaces and scripts based on SimplerEnv.

  1. Please follow the installation guide in the SimplerEnv Repo to set up the simulation environment, and make sure to place the repo under: ./third_libs/SimplerEnv

  2. Evaluation Example.

    # Run evaluation
    bash script/eval/bridge/eval_bridge.sh
    # Summarize results
    python script/eval/bridge/extract_bridge_results.py

    NOTE: Due to the instability of the SimplerEnv benchmark and diffusion process, the performance scores across different iterations can vary significantly. Please evaluate multiple checkpoints and report the best result.

Evaluation in LIBERO

We also provide evaluation interfaces and scripts based on LIBERO.

  1. Please follow the installation guide in the LIBERO Repo to set up the simulation environment, and make sure to place the repo under: ./third_libs/LIBERO

  2. Evaluation Example.

    # Run evaluation
    bash script/eval/libero/eval_libero.sh
    # Summarize results
    python script/eval/libero/extract_libero_results.py

    NOTE: The evaluation mechanism here is different from SimplerEnv. The process first loads the model using develop.py, then waits for a period before running evaluation/libero/eval_libero.py for testing. In addition, since performance may vary across iterations, please evaluate multiple checkpoints and report the best result.

Deployment in the Real World

To deploy the model on your own robot, first collect corresponding real-world manipulation data (e.g., via teleoperation), and use it to fine-tune the pretrained model.

Next, set up the server and client as shown in deploy.py, and deploy the system on your real robot.

The following command launches the server:

bash script/eval/real_world/deploy.sh

The robot acts as the client, and for each request it must send the following three items to obtain the action chunking result. The field episode_first_frame is a string ('True' or 'False') indicating whether the current frame is the first frame of the episode.

image = request.files['image']
query = request.form['text']
episode_first_frame = request.form['episode_first_frame']

This deployment process follows a similar design to OpenVLA and CogACT.

FAQ

SimplerEnv and ManiSkill may involve several dependency issues during installation. Below are some common troubleshooting tips based on our experience.

(1) Vulkan / SAPIEN issues
Example errors:
ImportError: libvulkan.so.1: cannot open shared object file: No such file or directory
Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.

Fix:

sudo apt install -y libegl1-mesa libgl1-mesa-dev libgles2-mesa-dev

and reference:
https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting

Note: Check that the .json files correctly link to the .so file corresponding to your current NVIDIA driver version. Use nvidia-smi to check your driver version and locate the correct .so under /usr/lib/x86_64-linux-gnu/.

(2) OpenGL issues
Example errors:
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Fix:

sudo apt install -y libgl1 libglib2.0-0 libglx-mesa0 libopengl0 libglu1-mesa mesa-utils

(3) Video recording in SimplerEnv

sudo apt install -y ffmpeg

(4) Benchmark Score Fluctuations

Benchmark scores tend to fluctuate, so we recommend evaluating checkpoints at regular iteration intervals and reporting the best result. Moreover, we have observed that even slight differences in Conda package versions may lead to small variations in the scores.

Citation

If you find our work helpful in your research, please consider citing our paper.

@article{shi2025memoryvla,
  title={MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation},
  author={Shi, Hao and Xie, Bin and Liu, Yingfei and Sun, Lin and Liu, Fengrong and Wang, Tiancai and Zhou, Erjin and Fan, Haoqiang and Zhang, Xiangyu and Huang, Gao},
  journal={arXiv preprint arXiv:2508.19236},
  year={2025}
}

@article{dexbotic,
  title={Dexbotic: Open-Source Vision-Language-Action Toolbox},
  author={Dexbotic Contributors},
  journal={arXiv preprint arXiv:2510.23511},
  year={2025}
}