JinXins/Awesome-Token-Merge-for-MLLMs
A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.
💫 Awesome-Token-Merge-for-MLLMs
Welcome to Awesome-Token-Merge-for-MLLMs.
If you know some related papers which don't conclute in this list, please tell me in Issues !)
If this repository has been helpful to you, please consider giving it a ⭐️ to show your support. Your support helps us reach more researchers and contributes to the growth of this resource. Thank you!
📜 Introduction
We summarize awesome token merge / reduce / resample methods in vision model for multi-modal large language models.
We do not update this repo, you could see more related works in this awesome Repo.
The list of token merge, reduce, drop, resample methods is summarized in chronological order and is on updating.
📖 Related Papers
Baseline
-
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
NIPS'2023 (oral) [Paper]
[Code] -
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh
CVPR'2024 [Paper]
[Code] -
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
arXiv'2023 [Paper]
[Code]
2024.3
-
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
CVPR'2024 [Paper]
[Code] -
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
arXiv'2024 [Paper] -
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai
arXiv'2024 [Paper]
[Code] -
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
ECCV'2024 (oral) [Paper]
[Code] -
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
arXiv'2024 [Paper]
[Code] -
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
arXiv'2024 [Paper]
[Code]
2024.5
- DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou
arXiv'2024 [Paper]
[Code]
2024.6
-
Efficient Large Multi-modal Models via Visual Context Compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille
NIPS'2024 [Paper]
[Code] -
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
arXiv'2024 [Paper]
[Code]
2024.7
-
TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
arXiv'2024 [Paper]
[Code] -
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie
arXiv'2024 [Paper]
[Code]
2024.8
-
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
arXiv'2024 [Paper] -
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
arXiv'2024 [Paper]
2024.9
-
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu
arXiv'2024 [Paper] -
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
arXiv'2024 [Paper] -
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang
arXiv'2024 [Paper]
[Code]
2024.10
-
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
arXiv'2024 [Paper]
[Code] -
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
arXiv'2024 [Paper] -
Retrieval Replace Reduction: An effective visual token reduction method via semantic match
Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li
arXiv'2024 [Paper] -
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi
arXiv'2024 [Paper] -
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin
arXiv'2024 [Paper]
[Code]
2024.11
-
Inference Optimal VLMs Need Only One Visual Token but Larger Models
Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter
arXiv'2024 [Paper]
[Code] -
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni
NIPS'2024 (Spotlight) [Paper]
[Code] -
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang
arXiv'2024 [Paper]
[Code] -
FoPru: Focal Pruning for Efficient Large Vision-Language Models
Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu
arXiv'2024 [Paper] -
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo
arXiv'2024 [Paper] -
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia
arXiv'2024 [Paper] -
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
arXiv'2024 [Paper]
[Code] -
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan
arXiv'2024 [Paper] -
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
arXiv'2024 [Paper]
2024.12
-
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang
arXiv'2024 [Paper]
[Code] -
Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
arXiv'2024 [Paper] -
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
arXiv'2024 [Paper]
[Code] -
Negative Token Merging: Image-based Adversarial Feature Guidance
Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
arXiv'2024 [Paper]
[Code] -
[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
arXiv'2024 [Paper]
[Code] -
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
arXiv'2024 [Paper]
[Code] -
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
arXiv'2024 [Paper]
[Code] -
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
arXiv'2024 [Paper]
[Code] -
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng
arXiv'2024 [Paper]
[Code] -
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
arXiv'2024 [Paper] -
DocVLM: Make Your VLM an Efficient Reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman
arXiv'2024 [Paper] -
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
Ke Wang, Hong Xuan
arXiv'2024 [Paper] -
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
arXiv'2024 [Paper] -
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
arXiv'2024 [Paper]
[Code] -
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
arXiv'2024 [Paper]
[Code]FEATHER Framework
-
FastVLM: Efficient Vision Encoding for Vision Language Models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
arXiv'2024 [Paper] -
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, Kai Han
arXiv'2024 [Paper]
[Code] -
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
arXiv'2024 [Paper]
[Code]
2025.1
-
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
arXiv'2024 [Paper]
[Code] -
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou
arXiv'2024 [Paper]
[Code] -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
arXiv'2024 [Paper]
[Code] -
Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen
arXiv'2024 [Paper]
[Code]






















































