AlexMaks02/CGD-MAE
CLIP Distillation-Driven pre-training framework for vehicle re-identification
CGD-MAE
CLIP DISTILLATION-DRIVEN PRE-TRAINING FRAMEWORK FOR VEHICLE RE-IDENTIFICATION
by Eurico Almeida, Bruno Silva, Alexandre Marques, Pedro Ferreira and Jorge Batista at Institute of Systems and Robotics and Dept. of Electical and Computer Engineering, University of Coimbra, Portugal.
This work introduces a CLIP-guided Masked Autoencoder (CGD-MAE) pre-training strategy designed to enhance the performance of existing ViT-based architectures for vehicle re-identification (V-ReID). We propose a simple yet effective ViT backbone pre-training approach that prioritizes data quality, quantity, and diversity, leveraging Automobile1M—a novel large-scale curated vehicle dataset derived from publicly available sources. Using for automatic data curation, we select one million diverse samples from a large pool of vehicle images, addressing long-tailed distributions and improving backbone performance. Moreover, a modified global-context semantic distillation from large CLIP models further emphasizes the impact of dataset curation. Pre-training CGD-MAE on Automobile1M has proven to be beneficial in enhancing performance in state-of-the-art (SoTA) ViT-based model architectures for V-ReID and various downstream vehicle-centric applications. These results highlight its strong potential as a universal vehicle-specific pre-training strategy, enhancing feature learning and adaptability across a wide range of vehicle-related tasks.
This is a PyTorch implementation of the CGD-MAE paper. The code is based on the original repo.
Models
The encoder part of CGD-MAE matches exactly with MAE. We also encourage you to try CGD-MAE checkpoints in your vehicle centric downstream tasks. These models are trained on Automobile1M for 400 epochs. ViT-B/16 224*224 image size.
| Method | Checkpoint |
|---|---|
| MAE | |
| CGD-MAE |
Pre-training CGD-MAE
torchrun --nproc_per_node=2 main_pretrain.py --batch_size 256 --model mae_vit_base_patch16 --mask_ratio 0.75 --epochs 400 --warmup_epochs 40 --blr 1.5e-4 --weight_decay 0.05 --accum_iter 8 --norm_pix_lossFine-tuning CGD-MAE
We validate our pre-trainings using the ViT-based model on three datasets,
,
, and
Best performance in bold.
Results in (*) are extracted from the corresponding method's references.
We benchmark the following vehicle tasks: attribute recognition (VAR) on with
; fine-grained classification (VFR) on
with
; and part segmentation (VPS) on
with
.
Citation
@INPROCEEDINGS{11084342,
author={Almeida, Eurico and Silva, Bruno and Marques, Alexandre and Ferreira, Pedro and Batista, Jorge},
booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
title={CGD-MAE: Clip Distillation-Driven Pre-Training Framework for Vehicle Re-Identification},
year={2025},
volume={},
number={},
pages={2784-2789},
keywords={Representation learning;Adaptation models;Heavily-tailed distribution;Codes;Image processing;Data integrity;Semantics;Autoencoders;Vehicle Re-Identification;Masked autoencoders;Knowledge Distillation;CLIP},
doi={10.1109/ICIP55913.2025.11084342}}

