FengheTan9/Mobile-U-ViT
[ACM MM 2025] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
Chengqi Dong1,2, Jie Yang3, Wei Liu3, S. Kevin Zhou1,2
2 Suzhou Institute for Advanced Research, University of Science and Technology of Chinaβ
3 School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
4 School of Computer Science and Technology, Harbin Institute of Technology
5 State Grid Hunan ElectricPower Corporation Limited Research Institute
News
- Mobile U-ViT accepted by ACM MM'25 π₯°
- Paper and Code released ! π
Abstract
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, exist ing mobile modelsβprimarily optimized for natural imagesβtend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining com putational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, uni versal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic dis crepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art per formance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis.
Results:
Quick Start
1. Environment
- GPU: NVIDIA GeForce RTX4090 GPU
- Pytorch: 1.13.0 cuda 11.7
- cudatoolkit: 11.7.1
- scikit-learn: 1.0.2
- albumentations: 1.2.0
2. Datasets
Please put the BUSI dataset or your own dataset as the following architecture.
βββ Mobile-U-ViT
βββ data
βββ busi
βββ images
| βββ benign (10).png
β βββ malignant (17).png
β βββ ...
|
βββ masks
βββ 0
| βββ benign (10).png
| βββ malignant (17).png
| βββ ...
βββ your dataset
βββ images
| βββ 0a7e06.png
β βββ ...
|
βββ masks
βββ 0
| βββ 0a7e06.png
| βββ ...
βββ dataloader
βββ network
βββ utils
βββ main.py
βββ split.py
3. 2D Training & Validation
You can first split your dataset:
python split.py --dataset_name busi --dataset_root ./dataThen, train and validate:
python main.py --model ["mobileuvit", "mobileuvit_l"] --base_dir ./data/busi --train_file_dir busi_train.txt --val_file_dir busi_val.txt4. 3D Training & Validation
Downstream pipeline can be referred to UNETR.
# An example of Training on BTCV (num_classes=14)
from network.MobileUViT_3D import mobileuvit_l
model = mobileuvit_l(inch=1, out_channel=14).cuda()Acknowledgements:
This code uses helper functions from CMUNeXt.
Citation
If the code, paper and weights help your research, please cite:
@inproceedings{tang2025mobile,
title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={3408--3417},
year={2025}
}
@article{tang2025mobile,
title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
journal={arXiv preprint arXiv:2508.01064},
year={2025}
}
License
This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.




