GitHunt
AL

AlexMaks02/CGD-MAE

CLIP Distillation-Driven pre-training framework for vehicle re-identification

CGD-MAE

CLIP DISTILLATION-DRIVEN PRE-TRAINING FRAMEWORK FOR VEHICLE RE-IDENTIFICATION
by Eurico Almeida, Bruno Silva, Alexandre Marques, Pedro Ferreira and Jorge Batista at Institute of Systems and Robotics and Dept. of Electical and Computer Engineering, University of Coimbra, Portugal.

PAPER

This work introduces a CLIP-guided Masked Autoencoder (CGD-MAE) pre-training strategy designed to enhance the performance of existing ViT-based architectures for vehicle re-identification (V-ReID). We propose a simple yet effective ViT backbone pre-training approach that prioritizes data quality, quantity, and diversity, leveraging Automobile1M—a novel large-scale curated vehicle dataset derived from publicly available sources. Using hierarchical k-means clustering for automatic data curation, we select one million diverse samples from a large pool of vehicle images, addressing long-tailed distributions and improving backbone performance. Moreover, a modified global-context semantic distillation from large CLIP models further emphasizes the impact of dataset curation. Pre-training CGD-MAE on Automobile1M has proven to be beneficial in enhancing performance in state-of-the-art (SoTA) ViT-based model architectures for V-ReID and various downstream vehicle-centric applications. These results highlight its strong potential as a universal vehicle-specific pre-training strategy, enhancing feature learning and adaptability across a wide range of vehicle-related tasks.

Vehicle Dataset data curation and text description pipeline

This is a PyTorch implementation of the CGD-MAE paper. The code is based on the original MAE repo.

Models

The encoder part of CGD-MAE matches exactly with MAE. We also encourage you to try CGD-MAE checkpoints in your vehicle centric downstream tasks. These models are trained on Automobile1M for 400 epochs. ViT-B/16 224*224 image size.

Method Checkpoint
MAE checkpoint
CGD-MAE checkpoint

Pre-training CGD-MAE

torchrun --nproc_per_node=2 main_pretrain.py --batch_size 256  --model mae_vit_base_patch16 --mask_ratio 0.75 --epochs 400 --warmup_epochs 40  --blr 1.5e-4 --weight_decay 0.05  --accum_iter 8 --norm_pix_loss

Fine-tuning CGD-MAE

We validate our pre-trainings using the TransReID ViT-based model on three datasets, VeRi-776, VehicleID, and VeRi-Wild Best performance in bold.
Results in (*) are extracted from the corresponding method's references.

Table2

We benchmark the following vehicle tasks: attribute recognition (VAR) on VeRi-776 with VTB; fine-grained classification (VFR) on StanfordCars with TransFG; and part segmentation (VPS) on PartImageNet subset of vehicle parts with SETR.

Table3

Citation

@INPROCEEDINGS{11084342,
  author={Almeida, Eurico and Silva, Bruno and Marques, Alexandre and Ferreira, Pedro and Batista, Jorge},
  booktitle={2025 IEEE International Conference on Image Processing (ICIP)}, 
  title={CGD-MAE: Clip Distillation-Driven Pre-Training Framework for Vehicle Re-Identification}, 
  year={2025},
  volume={},
  number={},
  pages={2784-2789},
  keywords={Representation learning;Adaptation models;Heavily-tailed distribution;Codes;Image processing;Data integrity;Semantics;Autoencoders;Vehicle Re-Identification;Masked autoencoders;Knowledge Distillation;CLIP},
  doi={10.1109/ICIP55913.2025.11084342}}

Languages

Python100.0%

Contributors

Created May 21, 2025
Updated September 23, 2025
AlexMaks02/CGD-MAE | GitHunt