PyTorch Implementation of the paper
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Bang Yang, Tong Zhang and Yuexian Zou*.
[2023-10-22] Please refer to our latest repository that inherits this codebase!
[2023-04-11] Release all pre-extracted features used in the paper
[2023-03-16] Add guideline for audio feature extraction; Update links for downloading pre-extracted features
[2023-02-23] Release the "dust-laden" code
git clone https://github.com/yangbang18/CLIP-Captioner.git --recurse-submodules
# Alternatively
git clone https://github.com/yangbang18/CLIP-Captioner.git
git submodule update --init
conda create -n vc python==3.7
conda activate vc
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install pytorch-lightning==1.5.1
pip install pandas h5py nltk pillow wget
Here we use torch 1.7.1
based on CUDA 10.1
. Lower versions of torch
are imcompatible with the submodule CLIP
.
└── base_data_path
├── MSRVTT
│ ├── feats
│ │ ├── image_R101_fixed60.hdf5
│ │ ├── CLIP_RN50.hdf5
│ │ ├── CLIP_RN101.hdf5
│ │ ├── CLIP_RN50x4.hdf5
│ │ ├── CLIP_ViT-B-32.hdf5
│ │ ├── motion_resnext101_kinetics_fixed60.hdf5
│ │ └── audio_vggish_audioset_fixed60.hdf5
│ ├── info_corpus.pkl
│ └── refs.pkl
└── VATEX
├── feats
│ ├── ...
│ └── audio_vggish_audioset_fixed60.hdf5
├── info_corpus.pkl
└── refs.pkl
Please remember to modify base_data_path
in config/Constants.py
Due to the legal and privacy concerns, we cannot directly share the downloaded videos or clips from YouTube in any way. Instead, we share the pre-processed files and pre-extracted feature files: MSRVTT, VATEX.
python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--modality $modality_combination \
--arch $arch_name \
--task $task_name
MSRVTT
VATEX
Transformer
: our baseline (autoregressive decoding)TopDown
: a two layer LSTM decoder (autoregressive decoding)ARB
: a slight different encoder (autoregressive decoding)NACF
: a slight different encdoer (non-autoregressive decoding)R101
: ResNet-101 (INP)RN50
: ResNet-50 (CLIP)RN101
: ResNet-101 (CLIP)RN50x4
: ResNet-50x4 (CLIP)ViT
: ViT-B/32 (CLIP)I3D
: used in VATEXa
(audio), m
(motion) and i
(image).base
: used in MSRVTTlarge
: used in VATEXdiff_feats
: to test different combinations of modalitiesVAP
: video-based attribute predictionVAP_SS0
: VAP
w/o sparse samplingTAP
: text-based attribute prediction (used in Transformer-based methods)TAP_RNN
: text-based attribute prediction (used in RNN-based methods)DAP
: dual attribute prediction (used in Transformer-based methods)DAP_RNN
: dual attribute prediction (used in RNN-based methods)Notes: In the publication version of our paper, attribute prediction (AP) is renamed as concept detection (CD). Here I keep the task name unchanged because of laziness.
Examples:
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality mi --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats RN101 --modality i --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats ViT --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method Transformer --feats ViT --modality ami --arch base --task DAP
python train.py --dataset MSRVTT --method TopDown --feats ViT --modality ami --arch base --task diff_feats
python train.py --dataset MSRVTT --method TopDown --feats ViT --modality ami --arch base --task DAP_RNN
python translate.py --checkpoint_paths $path_to_the_checkpoint
python translate.py --checkpoint_paths $path_to_the_checkpoint1 $path_to_the_checkpoint2 # ensembling
Please see the notebooks folder.
Please [★star] this repo and [cite] the following paper if you feel our code or models useful to your research:
@inproceedings{yang2022clip,
title={CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter},
author={Yang, Bang and Zhang, Tong and Zou, Yuexian},
booktitle={Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision},
pages={368--381},
year={2022},
organization={Springer}
}
@inproceedings{yang2021NACF,
title={Non-Autoregressive Coarse-to-Fine Video Captioning},
author={Yang, Bang and Zou, Yuexian and Liu, Fenglin and Zhang, Can},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={35},
number={4},
pages={3119-3127},
year={2021}
}