This work presents SuperClass, a super simple classification method that performs vision-language pre-training. Our method does not require a text encoder to be pre-trained on image-text data. Instead, it utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection.
git clone https://github.com/x-cls/superclass
cd superclass
pip install -r requirements.txt
Download the datasets Datacomp-1B and ImageNet-1K. You can also use other image-text pair datasets for training.
Modify the DATA_PATH and VAL_DATA_PATH in training script train.sh and train_combo.sh to your local paths to Datacomp-1B and ImageNet-1K.
To start CLIP training and superclass training, use the following command:
bash train.sh <config_path> opencls
This script will navigate to the opencls directory and execute the training.
If you want to include the LiT training phase, use the following command:
bash train_combo.sh <cls_config_path> <lit_config_path> opencls
CLS training config are here opencls/configs/cls_schedule
For example:
bash train.sh configs/cls_schedule/cls_vit_b16_s1.28B_bs16k.yaml opencls
Our codebase is built up on OpenCLIP and the ViTamin.
We thank the OpenCLIP and the ViTamin for contributing such impressive codes and models to our community.
The models & code of SuperClass are released under the Apache-2.0 license.
If you find this project useful, please consider citing:
@inproceedings{superclass_huang,
title={Classification Done Right for Vision-Language Pre-Training},
author={Huang, Zilong and Ye, Qinghao and Kang, Bingyi and Feng, Jiashi and Fan, Haoqi},
booktitle={NeurIPS},
year={2024}
}