yiren-jian / BLIText

[NeurIPS 2023] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
BSD 3-Clause "New" or "Revised" License
24 stars 1 forks source link
multimodal-deep-learning vision-language-pretraining vision-language-transformer

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

This repo covers implementations of BLIP2 with Pformer in Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. The paper is accepted to NeurIPS 2023. The code is developed based on LAVIS project (cloned on Feb 23, 2023).

We mainly add following files in lavis/models/blip2_models:

We also edit lavis/models/base_model.py to allow training from scratch, and include new dataset and dataloader in lavis/datasets of pure sentence dataset for training P-Former.

Installation

conda create -n lavis python=3.8
conda activate lavis
pip install -e .

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

Data Preparation

Please follow instructions from LAVIS to download pre-training datasets.

Training

stage 0 (training P-Former)

bash run_scripts/blip-T/train/pretrain_stage0.sh

stage 1 (training Q-Former with pre-trained P-Former)

bash run_scripts/blip-T/train/pretrain_stage1.sh

stage 2 (End-to-end BLIP2 with pre-trained P-Former)

bash run_scripts/blip-T/train/pretrain_stage2.sh

finetuning on MSCOCO

bash run_scripts/blip-T/train/train_caption_coco.sh

Pretrained Models

models trained with 4M data:

Evaluation

bash run_scripts/blip-T/eval/eval_gqa_zeroshot_opt2.7b.sh
bash run_scripts/blip-T/eval/eval_okvqa_zeroshot_opt2.7b.sh
bash run_scripts/blip-T/eval/validate_vqa_zeroshot_opt2.7b.sh
bash run_scripts/blip-T/eval/eval_cap_coco_opt2.7b.sh

Training and Evaluation Logs

You can find our training and evaluation logs here.

Acknowlegements

The code is developed based on BLIP2 and LAVIS project.

Citation

@inproceedings{
    jian2023bootstrapping,
    title={Bootstrapping Vision-Language Learning with Decoupled Language Pre-training},
    author = {Jian, Yiren and Gao, Chongyang and Vosoughi, Soroush},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=8Kch0ILfQH}
}