CGD-MAE

CLIP DISTILLATION-DRIVEN PRE-TRAINING FRAMEWORK FOR VEHICLE RE-IDENTIFICATION
by Eurico Almeida, Bruno Silva, Alexandre Marques, Pedro Ferreira and Jorge Batista at Institute of Systems and Robotics and Dept. of Electical and Computer Engineering, University of Coimbra, Portugal.

This work introduces a CLIP-guided Masked Autoencoder (CGD-MAE) pre-training strategy designed to enhance the performance of existing ViT-based architectures for vehicle re-identification (V-ReID). We propose a simple yet effective ViT backbone pre-training approach that prioritizes data quality, quantity, and diversity, leveraging Automobile1M—a novel large-scale curated vehicle dataset derived from publicly available sources. Using for automatic data curation, we select one million diverse samples from a large pool of vehicle images, addressing long-tailed distributions and improving backbone performance. Moreover, a modified global-context semantic distillation from large CLIP models further emphasizes the impact of dataset curation. Pre-training CGD-MAE on Automobile1M has proven to be beneficial in enhancing performance in state-of-the-art (SoTA) ViT-based model architectures for V-ReID and various downstream vehicle-centric applications. These results highlight its strong potential as a universal vehicle-specific pre-training strategy, enhancing feature learning and adaptability across a wide range of vehicle-related tasks.

This is a PyTorch implementation of the CGD-MAE paper. The code is based on the original repo.

Models

The encoder part of CGD-MAE matches exactly with MAE. We also encourage you to try CGD-MAE checkpoints in your vehicle centric downstream tasks. These models are trained on Automobile1M for 400 epochs. ViT-B/16 224*224 image size.

Method	Checkpoint
MAE
CGD-MAE

Pre-training CGD-MAE

torchrun --nproc_per_node=2 main_pretrain.py --batch_size 256  --model mae_vit_base_patch16 --mask_ratio 0.75 --epochs 400 --warmup_epochs 40  --blr 1.5e-4 --weight_decay 0.05  --accum_iter 8 --norm_pix_loss

Fine-tuning CGD-MAE

We validate our pre-trainings using the ViT-based model on three datasets, , , and Best performance in bold.
Results in (*) are extracted from the corresponding method's references.

We benchmark the following vehicle tasks: attribute recognition (VAR) on with ; fine-grained classification (VFR) on with ; and part segmentation (VPS) on with .

Citation

@INPROCEEDINGS{11084342,
  author={Almeida, Eurico and Silva, Bruno and Marques, Alexandre and Ferreira, Pedro and Batista, Jorge},
  booktitle={2025 IEEE International Conference on Image Processing (ICIP)}, 
  title={CGD-MAE: Clip Distillation-Driven Pre-Training Framework for Vehicle Re-Identification}, 
  year={2025},
  volume={},
  number={},
  pages={2784-2789},
  keywords={Representation learning;Adaptation models;Heavily-tailed distribution;Codes;Image processing;Data integrity;Semantics;Autoencoders;Vehicle Re-Identification;Masked autoencoders;Knowledge Distillation;CLIP},
  doi={10.1109/ICIP55913.2025.11084342}}

AlexMaks02/CGD-MAE

CGD-MAE

Models

Pre-training CGD-MAE

Fine-tuning CGD-MAE

Citation

On this page

Languages

Contributors