This repo covers implementations of VideoCaption + Pformer in Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. The code is developed based on LAVIS project (cloned on Mar 9, 2023).
We mainly add following files in lavis/models/blip2_models
(Pformer is named darkformer during the development):
video_feature_opt.py
video_feature_opt_new.py
video_feature_opt_stage1.py
video_feature_opt_stage2.py
# install lavis based on official LAVIS guideline
conda create -n lavis python=3.8
conda activate lavis
pip install -e .
# fix package version issues, use transformers==4.26.1
pip install -r pip_freeze.txt
The experiments are carried out on a single RTX-A6000. We provide our environment in pip_freeze.txt
, for closely reproducing of our results.
I3D features of VATEX can be downloaded from VATEX.
For video captioning, we use a P-former pretrained with 40M data (>12M). The pretrained P-former and the captioner weights will be released soon.
Please use the pretrained models from here.
models/ours/pformer/checkpoint_60000.pth
.models/ours/Caption_vatex_stage1
.models/ours/Caption_vatex_stage2
.models/ours/Caption_vatex_stage2/20240102015/result
.stage 1
bash run_scripts/blip2/train/train_caption_vatex_stage1.sh
stage 2
bash run_scripts/blip2/train/train_caption_vatex_stage2.sh
You could omit the missing keys
warning, see related discussion here
We use CLIPScore. Put compute_score.py
in the /clipscore
and run:
python compute_score.py
You can find our training (stage-1 and stage-2) and evaluation (w/ scores) logs here