yangbang18 / CLIP-Captioner

(PRCV'2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
7 stars 0 forks source link
clip pytorch video-captioning

CLIP-Captioner

PyTorch Implementation of the paper

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Bang Yang, Tong Zhang and Yuexian Zou*.

[Springer], [ArXiv]

TOC

Update Notes

[2023-10-22] Please refer to our latest repository that inherits this codebase!

[2023-04-11] Release all pre-extracted features used in the paper

[2023-03-16] Add guideline for audio feature extraction; Update links for downloading pre-extracted features

[2023-02-23] Release the "dust-laden" code

Environment

  1. Clone the repo:
git clone https://github.com/yangbang18/CLIP-Captioner.git --recurse-submodules

# Alternatively
git clone https://github.com/yangbang18/CLIP-Captioner.git
git submodule update --init
  1. Use Anaconda to create a new environment:
conda create -n vc python==3.7
conda activate vc
  1. Install necessary packages:
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install pytorch-lightning==1.5.1
pip install pandas h5py nltk pillow wget

Here we use torch 1.7.1 based on CUDA 10.1. Lower versions of torch are imcompatible with the submodule CLIP.

Corpora/Feature Preparation

Please remember to modify base_data_path in config/Constants.py

Due to the legal and privacy concerns, we cannot directly share the downloaded videos or clips from YouTube in any way. Instead, we share the pre-processed files and pre-extracted feature files: MSRVTT, VATEX.

Preparing data on your own (click for details) 1. Preprocessing corpora: ``` python pretreatment/prepare_corpora.py --dataset MSRVTT --sort_vocab --attribute_first python pretreatment/prepare_corpora.py --dataset VATEX --sort_vocab --attribute_first ``` 2. Feature extraction: * Downloading all video files of MSRVTT and VATEX * Extracting frames ``` python pretreatment/extract_frames_from_videos.py \ --video_path $path_to_video_files \ --frame_path $path_to_save_frames \ --video_suffix mp4 \ --frame_suffix jpg \ --strategy 0 ``` * Extracting image features of INP models ``` python pretreatment/extract_image_feats_from_frames.py \ --frame_path $path_to_load_frames \ --feat_path $base_data_path/$dataset_name \ --feat_name image_R101_fixed60.hdf5 \ --model resnet101 \ --frame_suffix jpg --gpu 0 ``` * Extracting image features of CLIP models ``` python pretreatment/clip_feats.py --dataset MSRVTT --arch RN50 python pretreatment/clip_feats.py --dataset MSRVTT --arch RN101 python pretreatment/clip_feats.py --dataset MSRVTT --arch RN50x4 python pretreatment/clip_feats.py --dataset MSRVTT --arch ViT-B/32 ``` * Extracting motion features: refer to [yangbang18/video-classification-3d-cnn](https://github.com/yangbang18/video-classification-3d-cnn) * Extracting audio features: refer to [yangbang18/vggish](https://github.com/yangbang18/vggish)

Training

python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--modality $modality_combination \
--arch $arch_name \
--task $task_name
  1. supported datasets
  1. supported methods, whose configurations can be found in config/methods.yaml
  1. supported feats, whose configurations can be found in config/feats.yaml
  1. supported modality combinations: any combination of a (audio), m (motion) and i (image).
  2. supported archs, whose configurations can be found in config/archs.yaml
  1. supported tasks, whose configurations can be found in config/tasks.yaml

Notes: In the publication version of our paper, attribute prediction (AP) is renamed as concept detection (CD). Here I keep the task name unchanged because of laziness.

Examples:

python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality mi --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality i --arch base --task diff_feats

python train.py --dataset MSRVTT --method Transformer --feats ViT --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats ViT --modality ami --arch base --task DAP

python train.py --dataset MSRVTT --method TopDown --feats ViT --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method TopDown --feats ViT --modality ami --arch base --task DAP_RNN

Testing

python translate.py --checkpoint_paths $path_to_the_checkpoint
python translate.py --checkpoint_paths $path_to_the_checkpoint1 $path_to_the_checkpoint2  # ensembling

Analysis/Examples

Please see the notebooks folder.

Citation

Please [★star] this repo and [cite] the following paper if you feel our code or models useful to your research:

@inproceedings{yang2022clip,
  title={CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter},
  author={Yang, Bang and Zhang, Tong and Zou, Yuexian},
  booktitle={Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision},
  pages={368--381},
  year={2022},
  organization={Springer}
}

@inproceedings{yang2021NACF,
  title={Non-Autoregressive Coarse-to-Fine Video Captioning}, 
  author={Yang, Bang and Zou, Yuexian and Liu, Fenglin and Zhang, Can},   
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={35},
  number={4},
  pages={3119-3127},
  year={2021}
}

Acknowledgements