GitHunt
FE

FengheTan9/Mobile-U-ViT

[ACM MM 2025] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Teaser

Fenghe Tang1,2, Bingkun Nian3, Jianrui Ding4, Wenxin Ma1,2, Quan Quan5,
Chengqi Dong1,2, Jie Yang3, Wei Liu3, S. Kevin Zhou1,2


arXiv github License: Apache2.0

News

  • Mobile U-ViT accepted by ACM MM'25 πŸ₯°
  • Paper and Code released ! 😎

Abstract

In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, exist ing mobile modelsβ€”primarily optimized for natural imagesβ€”tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining com putational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, uni versal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic dis crepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art per formance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis.

Teaser

Results:

Teaser

Teaser

Quick Start

1. Environment

  • GPU: NVIDIA GeForce RTX4090 GPU
  • Pytorch: 1.13.0 cuda 11.7
  • cudatoolkit: 11.7.1
  • scikit-learn: 1.0.2
  • albumentations: 1.2.0

2. Datasets

Please put the BUSI dataset or your own dataset as the following architecture.

└── Mobile-U-ViT
    β”œβ”€β”€ data
        β”œβ”€β”€ busi
            β”œβ”€β”€ images
            |   β”œβ”€β”€ benign (10).png
            β”‚   β”œβ”€β”€ malignant (17).png
            β”‚   β”œβ”€β”€ ...
            |
            └── masks
                β”œβ”€β”€ 0
                |   β”œβ”€β”€ benign (10).png
                |   β”œβ”€β”€ malignant (17).png
                |   β”œβ”€β”€ ...
        β”œβ”€β”€ your dataset
            β”œβ”€β”€ images
            |   β”œβ”€β”€ 0a7e06.png
            β”‚   β”œβ”€β”€ ...
            |
            └── masks
                β”œβ”€β”€ 0
                |   β”œβ”€β”€ 0a7e06.png
                |   β”œβ”€β”€ ...
    β”œβ”€β”€ dataloader
    β”œβ”€β”€ network
    β”œβ”€β”€ utils
    β”œβ”€β”€ main.py
    └── split.py

3. 2D Training & Validation

You can first split your dataset:

python split.py --dataset_name busi --dataset_root ./data

Then, train and validate:

python main.py --model ["mobileuvit", "mobileuvit_l"] --base_dir ./data/busi --train_file_dir busi_train.txt --val_file_dir busi_val.txt

4. 3D Training & Validation

Downstream pipeline can be referred to UNETR.

# An example of Training on BTCV (num_classes=14)
from network.MobileUViT_3D import mobileuvit_l

model = mobileuvit_l(inch=1, out_channel=14).cuda()

Teaser

Acknowledgements:

This code uses helper functions from CMUNeXt.

Citation

If the code, paper and weights help your research, please cite:

@inproceedings{tang2025mobile,
  title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
  author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={3408--3417},
  year={2025}
}

@article{tang2025mobile,
  title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
  author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
  journal={arXiv preprint arXiv:2508.01064},
  year={2025}
}

License

This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.

FengheTan9/Mobile-U-ViT | GitHunt