CLIP-Captioner

PyTorch Implementation of the paper

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Bang Yang, Tong Zhang and Yuexian Zou*.

[Springer], [ArXiv]

Update Notes

[2023-10-22] Please refer to our latest repository that inherits this codebase!

[2023-04-11] Release all pre-extracted features used in the paper

[2023-03-16] Add guideline for audio feature extraction; Update links for downloading pre-extracted features

[2023-02-23] Release the "dust-laden" code

Environment

Clone the repo:

git clone https://github.com/yangbang18/CLIP-Captioner.git --recurse-submodules

# Alternatively
git clone https://github.com/yangbang18/CLIP-Captioner.git
git submodule update --init

Use Anaconda to create a new environment:

conda create -n vc python==3.7
conda activate vc

Install necessary packages:

pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install pytorch-lightning==1.5.1
pip install pandas h5py nltk pillow wget

Here we use torch 1.7.1 based on CUDA 10.1. Lower versions of torch are imcompatible with the submodule CLIP.

Corpora/Feature Preparation

Following the structure below to place corpora and feature files:

└── base_data_path
  ├── MSRVTT
  │   ├── feats
  │   │   ├── image_R101_fixed60.hdf5
  │   │   ├── CLIP_RN50.hdf5
  │   │   ├── CLIP_RN101.hdf5
  │   │   ├── CLIP_RN50x4.hdf5
  │   │   ├── CLIP_ViT-B-32.hdf5
  │   │   ├── motion_resnext101_kinetics_fixed60.hdf5
  │   │   └── audio_vggish_audioset_fixed60.hdf5
  │   ├── info_corpus.pkl
  │   └── refs.pkl
  └── VATEX
      ├── feats
      │   ├── ...
      │   └── audio_vggish_audioset_fixed60.hdf5
      ├── info_corpus.pkl
      └── refs.pkl

Please remember to modify base_data_path in config/Constants.py

Due to the legal and privacy concerns, we cannot directly share the downloaded videos or clips from YouTube in any way. Instead, we share the pre-processed files and pre-extracted feature files: MSRVTT, VATEX.

Preparing data on your own (click for details)

1. Preprocessing corpora: ``` python pretreatment/prepare_corpora.py --dataset MSRVTT --sort_vocab --attribute_first python pretreatment/prepare_corpora.py --dataset VATEX --sort_vocab --attribute_first ``` 2. Feature extraction: * Downloading all video files of MSRVTT and VATEX * Extracting frames ``` python pretreatment/extract_frames_from_videos.py \ --video_path $path_to_video_files \ --frame_path $path_to_save_frames \ --video_suffix mp4 \ --frame_suffix jpg \ --strategy 0 ``` * Extracting image features of INP models ``` python pretreatment/extract_image_feats_from_frames.py \ --frame_path $path_to_load_frames \ --feat_path $base_data_path/$dataset_name \ --feat_name image_R101_fixed60.hdf5 \ --model resnet101 \ --frame_suffix jpg --gpu 0 ``` * Extracting image features of CLIP models ``` python pretreatment/clip_feats.py --dataset MSRVTT --arch RN50 python pretreatment/clip_feats.py --dataset MSRVTT --arch RN101 python pretreatment/clip_feats.py --dataset MSRVTT --arch RN50x4 python pretreatment/clip_feats.py --dataset MSRVTT --arch ViT-B/32 ``` * Extracting motion features: refer to [yangbang18/video-classification-3d-cnn](https://github.com/yangbang18/video-classification-3d-cnn) * Extracting audio features: refer to [yangbang18/vggish](https://github.com/yangbang18/vggish)

Training

python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--modality $modality_combination \
--arch $arch_name \
--task $task_name

supported datasets

MSRVTT
VATEX

supported methods, whose configurations can be found in config/methods.yaml

Transformer: our baseline (autoregressive decoding)
TopDown: a two layer LSTM decoder (autoregressive decoding)
ARB: a slight different encoder (autoregressive decoding)
NACF: a slight different encdoer (non-autoregressive decoding)

supported feats, whose configurations can be found in config/feats.yaml

R101: ResNet-101 (INP)
RN50: ResNet-50 (CLIP)
RN101: ResNet-101 (CLIP)
RN50x4: ResNet-50x4 (CLIP)
ViT: ViT-B/32 (CLIP)
I3D: used in VATEX

supported modality combinations: any combination of a (audio), m (motion) and i (image).
supported archs, whose configurations can be found in config/archs.yaml

base: used in MSRVTT
large: used in VATEX

supported tasks, whose configurations can be found in config/tasks.yaml

diff_feats: to test different combinations of modalities
VAP: video-based attribute prediction
VAP_SS0: VAP w/o sparse sampling
TAP: text-based attribute prediction (used in Transformer-based methods)
TAP_RNN: text-based attribute prediction (used in RNN-based methods)
DAP: dual attribute prediction (used in Transformer-based methods)
DAP_RNN: dual attribute prediction (used in RNN-based methods)

Notes: In the publication version of our paper, attribute prediction (AP) is renamed as concept detection (CD). Here I keep the task name unchanged because of laziness.

Examples:

python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality mi --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality i --arch base --task diff_feats

python train.py --dataset MSRVTT --method Transformer --feats ViT --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats ViT --modality ami --arch base --task DAP

python train.py --dataset MSRVTT --method TopDown --feats ViT --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method TopDown --feats ViT --modality ami --arch base --task DAP_RNN

Testing

python translate.py --checkpoint_paths $path_to_the_checkpoint
python translate.py --checkpoint_paths $path_to_the_checkpoint1 $path_to_the_checkpoint2  # ensembling

Analysis/Examples

Please see the notebooks folder.

Citation

Please [★star] this repo and [cite] the following paper if you feel our code or models useful to your research:

@inproceedings{yang2022clip,
  title={CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter},
  author={Yang, Bang and Zhang, Tong and Zou, Yuexian},
  booktitle={Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision},
  pages={368--381},
  year={2022},
  organization={Springer}
}

@inproceedings{yang2021NACF,
  title={Non-Autoregressive Coarse-to-Fine Video Captioning}, 
  author={Yang, Bang and Zou, Yuexian and Liu, Fenglin and Zhang, Can},   
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={35},
  number={4},
  pages={3119-3127},
  year={2021}
}

Acknowledgements

This codebase is built upon our previous one, i.e., Non-Autoregressive-Video-Captioning.
Code of the non-autoregressive decoding is based on facebookresearch/Mask-Predict.

yangbang18 / CLIP-Captioner

readme