Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Fenghe Tang^1,2, Bingkun Nian³, Jianrui Ding⁴, Wenxin Ma^1,2, Quan Quan⁵,
Chengqi Dong^1,2, Jie Yang³, Wei Liu³, S. Kevin Zhou^1,2

¹ School of Biomedical Engineering, University of Science and Technology of China
² Suzhou Institute for Advanced Research, University of Science and Technology of China
³ School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
⁴ School of Computer Science and Technology, Harbin Institute of Technology
⁵ State Grid Hunan ElectricPower Corporation Limited Research Institute

News

Mobile U-ViT accepted by ACM MM'25 🥰
Paper and Code released ! 😎

Abstract

In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, exist ing mobile models—primarily optimized for natural images—tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining com putational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, uni versal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic dis crepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art per formance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis.

Results:

Quick Start

1. Environment

GPU: NVIDIA GeForce RTX4090 GPU
Pytorch: 1.13.0 cuda 11.7
cudatoolkit: 11.7.1
scikit-learn: 1.0.2
albumentations: 1.2.0

2. Datasets

Please put the BUSI dataset or your own dataset as the following architecture.

└── Mobile-U-ViT
    ├── data
        ├── busi
            ├── images
            |   ├── benign (10).png
            │   ├── malignant (17).png
            │   ├── ...
            |
            └── masks
                ├── 0
                |   ├── benign (10).png
                |   ├── malignant (17).png
                |   ├── ...
        ├── your dataset
            ├── images
            |   ├── 0a7e06.png
            │   ├── ...
            |
            └── masks
                ├── 0
                |   ├── 0a7e06.png
                |   ├── ...
    ├── dataloader
    ├── network
    ├── utils
    ├── main.py
    └── split.py

3. 2D Training & Validation

You can first split your dataset:

python split.py --dataset_name busi --dataset_root ./data

Then, train and validate:

python main.py --model ["mobileuvit", "mobileuvit_l"] --base_dir ./data/busi --train_file_dir busi_train.txt --val_file_dir busi_val.txt

4. 3D Training & Validation

Downstream pipeline can be referred to UNETR.

# An example of Training on BTCV (num_classes=14)
from network.MobileUViT_3D import mobileuvit_l

model = mobileuvit_l(inch=1, out_channel=14).cuda()

Acknowledgements:

This code uses helper functions from CMUNeXt.

Citation

If the code, paper and weights help your research, please cite:

@inproceedings{tang2025mobile,
  title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
  author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={3408--3417},
  year={2025}
}

@article{tang2025mobile,
  title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
  author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
  journal={arXiv preprint arXiv:2508.01064},
  year={2025}
}

License

This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.

FengheTan9/Mobile-U-ViT