PyTorch Implementation of our TIP paper:
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information
Bang Yang, Meng Cao and Yuexian Zou.
[2023-10-22] We release the code and data.
Clone and enter the repo:
git clone https://github.com/yangbang18/CARE.git
cd CARE
We has refactored the code and tested it on:
Python
3.9torch
1.13.1cuda
11.7 Other versions are also work, e.g., Python
3.7, torch
1.7.1, and cuda
10.1.
Please change the version of torch and cuda according to your hardwares.
conda create -n CARE python==3.9
conda activate CARE
# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 -f https://download.pytorch.org/whl/cu117/torch_stable.html
# Note that `torch < 1.7.1` is imcompatible with the package `clip`
pip install -r requirement.txt
1. Supported datasets (please follow README_DATA.md to prepare data)
MSVD
MSRVTT
VATEX
2. Supported methods, whose configurations can be found in config/methods.yaml
Transformer
: our baseline (autoregressive)TopDown
: same encoder as Transformer
, a two layer LSTM decoder (autoregressive)ARB
: a slight different encoder compared with Transformer
(autoregressive)NACF
: same encoder as ARB
, different decoding algorithm (non-autoregressive)3. Supported feats, whose configurations can be found in config/feats.yaml
4. Supported modality combinations: any combination of a
(audio), m
(motion) and i
(image).
5. Supported architectures, whose configurations can be found in config/archs.yaml
base
: used in MSVD, MSRVTT (model dimension: 512
)median
: used in VATEX (model dimension: 768
)large
: used in VATEX (model dimension: 1,024
)6. Supported tasks, whose configurations can be found in config/tasks.yaml.
Base
: take different combination of features for decoding.CABase
: visual-driven concept detection and ``Cross -> Semantic'' local semantic guidance.CARE
: multimodal-driven concept detection (MCD) and global-local semantic guidance (G-LSG).Concept
: basic implementations of CABase
and CARE
, it supports the modification of hyper-parameters related to concept detection.Command Format:
For the Base
task:
python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--modality $modality_combination \
--arch $arch_name \
--task $task_name
For CAbase
, CARE
, and Concept
tasks:
python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--decoder_modality_flags $flags1 \
--predictor_modality_flags $flags2 \
--arch $arch_name \
--task $task_name
Note: the only different is that we need to specify decoder_modality_flags
and predictor_modality_flags
rather than modality
. This is because modalities used for concept detection and decoding can be different. Here are some mappings between flag
and modality
(refer to the flag2modality
of config/Constants.py):
I
(image): iV
(vision): miA
(audio): aVA
(vision + audio): amiVAT
(vision + audio + text): amirExample:
python train.py \
--dataset MSRVTT \
--method Transformer \
--arch base \
--task Base \
--feats ViT \
--modality ami
python train.py \
--dataset MSRVTT \
--method Transformer \
--arch base \
--task CARE \
--feats ViT \
--decoder_modality_flags VA \
--predictor_modality_flags VAT
ckpt=/path/to/checkpoint
python translate.py --checkpoint_paths $ckpt
# evaluate on the validation set
python translate.py --checkpoint_paths $ckpt --mode validate
# evaluate on the validation set & save results to a csv file (same directory as the checkpoint)
python translate.py --checkpoint_paths $ckpt --mode validate --save_csv --csv_name val_result.csv
# evaluate on the validation set & save results to a csv file
python translate.py --checkpoint_paths $ckpt --mode validate --save_csv --csv_path ./results/csv --csv_name dummy.csv
# save caption predictions
python translate.py --checkpoint_paths $ckpt --json_path ./results/predictions --json_name dummy.json
# save detailed per-sample scores
python translate.py --checkpoint_paths $ckpt --save_detailed_scores_path ./results/detailed_scores/dummy.json
You can run the following command to gather results, where mean metric scores with their standard deviation across a number of runs are reported.
python misc/merge_csv.py --dataset MSVD --average --output_path ./results --output_name MSVD.csv
bash scripts/exp_main_MSVD.sh
bash scripts/exp_main_MSRVTT.sh
bash scripts/exp_main_VATEX.sh
Note: each script is self-enclosed, i.e., a model variant may be included in N
differnt scripts, each of which will train the model in K
(K=5 by default) different seeds ([0, K)
), resulting in N x K
runs. Just be careful.
Main Ablation Study
bash scripts/exp_ablation_main.sh
Ablation on GSG and LSG variants
bash scripts/exp_ablation_GLSG.sh
Versatility of CARE on Various Encoder-Decoder Networks
bash scripts/exp_versatility_of_CARE.sh
Note: To reproduce SwinBERT, we do not train the model in an end-to-end manner. Instead, we conduct experiments on features extracted from fine-tuned SwinBERT checkpoints. Therefore, if you want to follow our implementations, please refer to yangbang18/Video-Swin-Transformer for feature extraction first (we do not share the feature files due to their large size). Otherwise, you should comment the command lines about SwinBERT in scripts/exp_versatility_of_CARE.sh
.
Measuring retrieval performance
python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch RN50
python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch RN50x16
python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch ViT-B/32
python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch ViT-B/32 --postfix ft
python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch ViT-B/16 --postfix ft
Note:
Please refer to notebooks.
Please [★star] this repo and [cite] the following papers if you feel our code and data useful to your research:
@ARTICLE{yang2023CARE,
author={Yang, Bang and Cao, Meng and Zou, Yuexian},
journal={IEEE Transactions on Image Processing},
title={Concept-Aware Video Captioning: Describing Videos With Effective Prior Information},
year={2023},
volume={32},
number={},
pages={5366-5378},
doi={10.1109/TIP.2023.3307969}
}