yiren-jian / BLIText-video

[NeurIPS 2023] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training: Video Captioning
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

This repo covers implementations of VideoCaption + Pformer in Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. The code is developed based on LAVIS project (cloned on Mar 9, 2023).

We mainly add following files in lavis/models/blip2_models (Pformer is named darkformer during the development):

Installation

# install lavis based on official LAVIS guideline
conda create -n lavis python=3.8
conda activate lavis
pip install -e .

# fix package version issues, use transformers==4.26.1
pip install -r pip_freeze.txt

The experiments are carried out on a single RTX-A6000. We provide our environment in pip_freeze.txt, for closely reproducing of our results.

Data Preparation

I3D features of VATEX can be downloaded from VATEX.

Pre-trained Models

For video captioning, we use a P-former pretrained with 40M data (>12M). The pretrained P-former and the captioner weights will be released soon.

Please use the pretrained models from here.

Training

stage 1

bash run_scripts/blip2/train/train_caption_vatex_stage1.sh

stage 2

bash run_scripts/blip2/train/train_caption_vatex_stage2.sh

You could omit the missing keys warning, see related discussion here

Evaluation

We use CLIPScore. Put compute_score.py in the /clipscore and run:

python compute_score.py

Training and Evaluation Logs

You can find our training (stage-1 and stage-2) and evaluation (w/ scores) logs here