MiG-NJU/PersonaVLM
[CVPR 2026] PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM: Long-Term Personalized Multimodal LLMs
Official implementation of PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization.
๐ Table of Contents
- ๐ News
- ๐ Overview
- ๐ Persona-MME Benchmark
- ๐ Performance
- ๐ Quick Start
- ๐ป Interactive Gradio Demo
- โ๏ธ Citation
๐ News
- [2026.03] ๐ข PersonaVLM officially Open-Sourced! Model weights, training data, and the Persona-MME benchmark are now released!
๐ Overview
PersonaVLM transforms general-purpose MLLMs (e.g., Qwen2.5-VL) into personalized assistants capable of long-term interaction. It achieves this through a collaborative two-stage process featuring three core capabilities:
- ๐ง Remembering: Proactively extracts and summarizes multimodal conversational histories into a structured, multi-type database (Core, Semantic, Episodic, and Procedural memories).
- ๐ก Reasoning: Conducts multi-turn reasoning by dynamically retrieving and integrating relevant long-term memories based on the conversation context.
- ๐ค Response Alignment: Infers the user's latest latent traits using a Momentum-based Personality Evolving Mechanism (PEM), ensuring generated responses are deeply aligned with the user's evolving characteristics.
๐ Persona-MME Benchmark
To rigorously evaluate long-term personalization in multimodal settings, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases designed to assess MLLMs across 14 fine-grained personalization tasks.
๐ Performance
Extensive experiments demonstrate that PersonaVLM significantly enhances a model's personalization capabilities and consistently outperforms strong counterparts, including proprietary models like GPT-4o and leading open-source alternatives. Under a 128k context setting, PersonaVLM achieves substantial improvements of 22.4% on Persona-MME and 9.8% on PERSONAMEM, surpassing GPT-4o by 5.2% and 2.0%, respectively.
๐ Quick Start
1. Repository Structure
Before starting, ensure your project directory is organized as follows. The core reasoning and memory management logic is encapsulated within the PersonaVLM/ module.
PersonaVLM/
โโโ checkpoints/
โ โโโ rl/ # Download PersonaVLM weights here
โ โโโ openai/clip-vit-base-patch32 # Local CLIP model
โ โโโ sentence-transformers/all-MiniLM-L6-v2
โโโ data/
โ โโโ Persona-MME/ # Benchmark dataset for evaluation
โ โโโ training/ # SFT and RL synthesized datasets
โโโ PersonaVLM/ # Core PersonaVLM Agent Logic
โ โโโ model.py
โ โโโ PersonaVLMAgent.py
โ โโโ prompts.py
โ โโโ retriever.py
โ โโโ tools.py
โ โโโ utils.py
โโโ train/ # SFT and RL training scripts
โโโ eval.py # Evaluation script for Persona-MME
โโโ inference.py # CLI inference script
โโโ gradio_demo.py # Interactive Web UI
2. Environment Setup
Create a conda environment and install the required dependencies.
conda create -n PersonaVLM python=3.10 -y
conda activate PersonaVLM
pip install -r requirements.txtWarning
Since PersonaVLM is built upon Qwen2.5-VL, it is strictly required to install transformers==4.51.3 to prevent compatibility issues.
3. Prepare Models and Data
- Model Weights: Download the official PersonaVLM Model and place it under
./checkpoints/rl. - Retrieval Models: Download CLIP and all-MiniLM-L6-v2 models to the
./checkpoints/directory. - Datasets: Download Persona-MME (for evaluation) and PersonaVLM-Dataset (for training) into
./data/.
4. Inference
We provide a CLI script to chat with the PersonaVLM agent. It supports both standard reasoning and forced retrieval configurations:
python inference.py
# Optional arguments:
# --force-retrieve : Force the agent to execute memory retrieval for every message.
# --reasoning-mode : Flag to toggle reasoning (default is True; passing this sets it to False).5. Evaluation (Persona-MME)
To evaluate the model's long-term personalization capabilities on the Persona-MME benchmark:
python eval.py \
--model_path ./checkpoints/rl \
--bench_context 32k \
--save_dir ./output/32k_Persona_MME_results6. Training Pipeline
PersonaVLM is trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
Stage 1: Supervised Fine-Tuning (SFT)
Note on Qwen2.5-VL: The Qwen2.5-VL repository used for training does not support mixing pure-text and multimodal samples in a single batch. To resolve this, we append a dummy conversation to each text-only sample and mask the final interaction step during loss computation (Implementation details in ./train/sft/Qwen2.5-VL/qwen-vl-finetune/qwenvl/data/data_qwen.py).
# 1. Regenerate data with dummy convs
python ./data/training/sft/regenerate.py
# 2. Launch SFT training
cd ./train/sft/Qwen2.5-VL/qwen-vl-finetune
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 sh scripts/sft_PersonaVLM.shStage 2: Reinforcement Learning (GRPO)
Our RL stage is implemented based on the ms-swift framework. Key modifications for our custom reward design are located in scripts/qwen_server.py and examples/train/grpo/plugin/plugin.py.
Before starting RL, you must deploy the reward model (we use Qwen3-30B-A3B-Instruct-2507) via vLLM:
# 1. Deploy the Reward Model Server
MASTER_PORT=29501 CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server \
--model /path/to/Qwen3-30B-A3B-Instruct-2507 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8080 \
--trust-remote-code \
--gpu-memory-utilization 0.9
# (Optional) Test the reward server connection
python ./train/ms-swift-main/scripts/qwen_server.py
# 2. Launch RL Training
cd ./train/ms-swift-main
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 sh scripts/rl.sh๐ป Interactive Gradio Demo
We provide an interactive Gradio-based playground to explore the "R3" process (Proactive Remembering, Multi-step Reasoning, and Personality-based Response Alignment). The UI allows you to visualize the agent's internal cognitive steps.
Launch Locally
To start the demo, ensure your model weights are in ./checkpoints/rl and run:
python gradio_demo.pyโ๏ธ Citation
If you find our work helpful, please cite:
@inproceedings{nie2026personavlm,
title={PersonaVLM: Long-Term Personalized Multimodal LLMs},
author={Nie, Chang and Fu, Chaoyou and Zhang, Yifan and Yang, Haihua and Shan, Caifeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}




