The code of IJCAI22 paper "GL-RG: Global-Local Representation Granularity for Video Captioning".
GL-RG
exploit extensive vision representations from different video ranges to improve linguistic expression. We devise a novel global-local encoder to produce rich semantic vocabulary. With our incremental training strategy, GL-RG
successfully leverages the global-local vision representation to achieve fine-grained captioning on video contents.
This repo was tested with Python 2.7, PyTorch 1.0.1 (or 0.2.0), cuDNN 10.0 (or 6.0), with CUDA 8.0. But it should be runnable with more recent PyTorch>=1.0 (or >=0.2, <=1.0) versions.
You can use anaconda or miniconda to install the dependencies:
conda create -n GL-RG-pytorch python=2.7 pytorch=1.0 scikit-image h5py requests
conda activate GL-RG-pytorch
or you can install the dependencies following this script:
conda env create -f environment.yaml
conda activate GL-RG-pytorch
First clone the this repository to any location using --recursive
:
git clone --recursive https://github.com/ylqi/GL-RG.git
Check out the coco-caption/
, cider/
, data/
and model/
projects into your working directory. If not, please find detailed steps INSTALL.md for installation and dataset preparation.
Then, please run following script to download Stanford CoreNLP 3.6.0 models into coco-caption/
:
cd coco-caption
./get_stanford_models.sh
Models | Dataset | Exp. | B@4 | M | R | C | Links |
---|---|---|---|---|---|---|---|
GL-RG | MSR-VTT | XE | 45.5 | 30.1 | 62.6 | 51.2 | GL-RG_XE_msrvtt |
GL-RG | MSR-VTT | DXE | 46.9 | 30.4 | 63.9 | 55.0 | GL-RG_DXE_msrvtt |
GL-RG + IT | MSR-VTT | DR | 46.9 | 31.2 | 65.7 | 60.6 | GL-RG_DR_msrvtt |
GL-RG | MSVD | XE | 55.5 | 37.8 | 74.7 | 94.3 | GL-RG_XE_msvd |
GL-RG | MSVD | DXE | 57.7 | 38.6 | 74.9 | 95.9 | GL-RG_DXE_msvd |
GL-RG + IT | MSVD | DR | 60.5 | 38.9 | 76.4 | 101.0 | GL-RG_DR_msvd |
Check out the trained model weights under the model/
directory (following Installation) and run:
./test.sh
Note: Please modify MODEL_NAME
, EXP_NAME
and DATASET
in test.sh
if experiment setting changes. For more details please refer to TEST.md.
For Seeding Phase (e.g., using XE):
./train.sh 1 # | 0 - using XE | 1 - using DXE |
For Boosting Phase (e.g., using DR with b1):
./train.sh 3 # | 2 - with SCST baseline | 3 - with b1 baseline | 4 - with b2 baseline |
Note: For higher performance, please increase the batch size using --batch_size
in train.sh
. For more variants, please set --start_from
in train.sh
to determine the Incremental Training entrance model, set --use_long_range
, --use_short_range
and --use_local
to enable different global-local features:
--use_long_range
: enable long-range features.--use_short_range
: enable short-range features.--use_local
: enable local-keyframe features.Modify the DATASET
(choices: 'msrvtt', 'msvd') in train.sh
when switch to MSR-VTT or MSVD benchmark.
GL-RG
is released under the MIT license.
We are truly thankful of the following prior efforts in terms of knowledge contributions and open-source repos.
If you find our work useful in your research, please consider citing:
@InProceedings{yan2018GL-RG,
title={GL-RG: Global-Local Representation Granularity for Video Captioning},
author={Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang and Dongfang Liu},
booktitle={IJCAI},
year={2022}
}