SynthCLIP: Are We Ready For a Fully Synthetic CLIP Training?

Hasan Abed Al Kader Hammoud^1* Hani Itani^1* Fabio Pizzati² Philip Torr² Adel Bibi² Bernard Ghanem¹
¹ KAUST, ² University of Oxford,

🔥 Stay tuned for updates, and don't forget to star this repo for the latest on SynthCLIP! 🔥

📜 Abstract

We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images.

🚀 Getting Started

Environment Setup

First, let's set up the Conda environment to get you up and running:

Synthesis & ~~"training" (numpy 1.0 vs 2.0 issues for Victor 😖)~~:

conda deactivate && conda env remove -n synthclip -y &&
module load deeplearning/cuda/11.8-cudnn8.8.1 &&
conda create -n synthclip2 python=3.10 -y &&
conda activate synthclip2 &&
pip install https://github.com/vllm-project/vllm/releases/download/v0.3.0/vllm-0.3.0+cu118-cp310-cp310-manylinux1_x86_64.whl &&
# pip uninstall torch -y &&
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
echo perhaps try conda install pytorch==2.1.2 torchvision==0.20.0 torchaudio==2.5.0  pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip uninstall xformers -y &&
pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118 &&
pip install -r requirements.txt

Only training:

conda create -n synthclip-train python=3.10 -y &&
conda activate synthclip-train &&
# pip install https://github.com/vllm-project/vllm/releases/download/v0.3.0/vllm-0.3.0+cu118-cp310-cp310-manylinux1_x86_64.whl &&
# pip uninstall torch -y &&
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
echo perhaps try conda install pytorch==2.1.2 torchvision==0.20.0 torchaudio==2.5.0  pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip uninstall xformers -y &&
pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118 &&
pip install -r requirements.txt

Debugging wds:

# import webdataset as wds; print(wds.__version__)
# '0.2.100'
pip install 'webdataset>=0.2.5,<=0.2.86'

To add a new section to your README that explains the process and structure of your project, including the specific order of operations and the README files in different directories, you might format it like this:

📁 Project Structure and Execution Order

Our project is organized into three main folders, each dedicated to a specific stage in the SynthCLIP pipeline. Inside each folder, you'll find a detailed README.md file that provides instructions on how to run the code for that stage.

Folders and Their Functions:

TextGen: This folder contains all the necessary code to generate synthetic text data. Begin here to start the pipeline process.
ImageGen: After generating the text, move on to this folder. It uses the synthetic text data to generate corresponding synthetic images.
Training: The final stage of the pipeline. Once you have your synthetic text-image pairs, this folder contains the code to train the SynthCLIP model.

Pipeline Overview:

To successfully use SynthCLIP, follow the pipeline in the order mentioned:

Generate Text ➡️ Start with the TextGen folder.
Generate Images ➡️ Proceed to ImageGen with your synthetic text.
Train the Model ➡️ Finally, use the Training folder to train SynthCLIP with your synthetic text-image pairs.

🤗 SynthCI 30M Dataset Download

Our dataset, SynthCI 30M, containing 30M image-caption pairs is hosted on HuggingFace. To download the dataset using HuggingFace Client please ensure that you have the huggingface-cli module installed by running:

pip install -U "huggingface_hub[cli]"

The dataset could then be installed using huggingface-cli download hammh0a/SynthCLIP --repo-type dataset.

Alternatively, the dataset could be loaded using HuggingFace datasets library in Python as follows:

from datasets import load_dataset
dataset = load_dataset('hammh0a/SynthCLIP')

📦 Trained Models

Jumpstart your experiments with our pre-trained models:

ViT-B/16 Trained on SynthCI-10M ➡️ Download
ViT-B/16 Trained on SynthCI-20M ➡️ Download
ViT-B/16 Trained on SynthCI-30M ➡️ Download
ViT-B/16 Trained on CC12M ➡️ Download

You can load and use the pretrained model using the code below:

from models import CLIP_VITB16
import torch

# load model instance
model = torch.nn.DataParallel(CLIP_VITB16())

# load checkpoint
checkpoint_path = "./checkpoint_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")
load_status = model.load_state_dict(checkpoint["state_dict"])

print(load_status)

📖 Citation

If you find SynthCLIP useful in your research, please consider citing:

@misc{hammoud2024synthclip,
      title={SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?},
      author={Hasan Abed Al Kader Hammoud and Hani Itani and Fabio Pizzati and Philip Torr and Adel Bibi and Bernard Ghanem},
      year={2024},
      eprint={2402.01832},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

escorciav/synthclip