yangbang18 / CARE

(TIP'2023) Concept-Aware Video Captioning: Describing Videos with Effective Prior Information
20 stars 0 forks source link
concept-detection pytorch video-captioning

CARE

PyTorch Implementation of our TIP paper:

Concept-Aware Video Captioning: Describing Videos With Effective Prior Information

Bang Yang, Meng Cao and Yuexian Zou.

[IEEE Xplore]

TOC

Update Notes

[2023-10-22] We release the code and data.

Environment

Clone and enter the repo:

git clone https://github.com/yangbang18/CARE.git
cd CARE

We has refactored the code and tested it on:

Other versions are also work, e.g., Python 3.7, torch 1.7.1, and cuda 10.1.

Please change the version of torch and cuda according to your hardwares.

conda create -n CARE python==3.9
conda activate CARE

# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117  -f https://download.pytorch.org/whl/cu117/torch_stable.html

# Note that `torch < 1.7.1` is imcompatible with the package `clip`
pip install -r requirement.txt

Running

Overview

1. Supported datasets (please follow README_DATA.md to prepare data)

2. Supported methods, whose configurations can be found in config/methods.yaml

3. Supported feats, whose configurations can be found in config/feats.yaml

4. Supported modality combinations: any combination of a (audio), m (motion) and i (image).

5. Supported architectures, whose configurations can be found in config/archs.yaml

6. Supported tasks, whose configurations can be found in config/tasks.yaml.

Training

Command Format:

For the Base task:

python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--modality $modality_combination \
--arch $arch_name \
--task $task_name

For CAbase, CARE, and Concept tasks:

python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--decoder_modality_flags $flags1 \
--predictor_modality_flags $flags2 \
--arch $arch_name \
--task $task_name

Note: the only different is that we need to specify decoder_modality_flags and predictor_modality_flags rather than modality. This is because modalities used for concept detection and decoding can be different. Here are some mappings between flag and modality (refer to the flag2modality of config/Constants.py):

Example:

python train.py \
--dataset MSRVTT \
--method Transformer \
--arch base \
--task Base \
--feats ViT \
--modality ami  

python train.py \
--dataset MSRVTT \
--method Transformer \
--arch base \
--task CARE \
--feats ViT \
--decoder_modality_flags VA \
--predictor_modality_flags VAT

Testing

ckpt=/path/to/checkpoint

python translate.py --checkpoint_paths $ckpt

# evaluate on the validation set
python translate.py --checkpoint_paths $ckpt --mode validate

# evaluate on the validation set & save results to a csv file (same directory as the checkpoint)
python translate.py --checkpoint_paths $ckpt --mode validate --save_csv --csv_name val_result.csv

# evaluate on the validation set & save results to a csv file
python translate.py --checkpoint_paths $ckpt --mode validate --save_csv --csv_path ./results/csv --csv_name dummy.csv

# save caption predictions
python translate.py --checkpoint_paths $ckpt --json_path ./results/predictions --json_name dummy.json

# save detailed per-sample scores
python translate.py --checkpoint_paths $ckpt --save_detailed_scores_path ./results/detailed_scores/dummy.json

Show Results

You can run the following command to gather results, where mean metric scores with their standard deviation across a number of runs are reported.

python misc/merge_csv.py --dataset MSVD --average --output_path ./results --output_name MSVD.csv

Reproducibility

Main Experiments (Compared with SOTA)

bash scripts/exp_main_MSVD.sh
bash scripts/exp_main_MSRVTT.sh
bash scripts/exp_main_VATEX.sh

Ablation Study

Note: each script is self-enclosed, i.e., a model variant may be included in N differnt scripts, each of which will train the model in K (K=5 by default) different seeds ([0, K)), resulting in N x K runs. Just be careful.

Analysis

Please refer to notebooks.

Citation

Please [★star] this repo and [cite] the following papers if you feel our code and data useful to your research:

@ARTICLE{yang2023CARE,
  author={Yang, Bang and Cao, Meng and Zou, Yuexian},
  journal={IEEE Transactions on Image Processing}, 
  title={Concept-Aware Video Captioning: Describing Videos With Effective Prior Information}, 
  year={2023},
  volume={32},
  number={},
  pages={5366-5378},
  doi={10.1109/TIP.2023.3307969}
}

Acknowledgement