wjun0830 / QD-DETR

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)
https://arxiv.org/abs/2303.13874
Other
204 stars 16 forks source link
computer-vision deep-learning detection-transformer moment-retrieval multi-modal text-video-retrieval video-highlight-detection video-retrieval video-summarization

QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection (CVPR 2023 Paper)

by WonJun Moon1, SangEek Hyun1, SangUk Park2, Dongchan Park2, Jae-Pil Heo1

1 Sungkyunkwan University, 2 Pyler, * Equal Contribution

PWC PWC PWC

[Arxiv] [Paper] [Project Page] [Video]


Updates & News

Prerequisites

0. Clone this repo

1. Prepare datasets

(2023/11/21) For a newer version of instructions for preparing datasets, please refer to CG-DETR.

QVHighlights : Download official feature files for QVHighlights dataset from Moment-DETR.

Download moment_detr_features.tar.gz (8GB), extract it under '../features' directory. You can change the data directory by modifying 'feat_root' in shell scripts under 'qd_detr/scripts/' directory.

tar -xf path/to/moment_detr_features.tar.gz

TVSum : Download feature files for TVSum dataset from UMT.

Download TVSum (69.1MB), and either extract it under '../features/tvsum/' directory or change 'feat_root' in TVSum shell files under 'qd_detr/scripts/tvsum/'.

2. Install dependencies. Python version 3.7 is required.

pip install -r requirements.txt

For anaconda setup, please refer to the official Moment-DETR github.

QVHighlights

Training

Training with (only video) and (video + audio) can be executed by running the shell below:

bash qd_detr/scripts/train.sh --seed 2018
bash qd_detr/scripts/train_audio.sh --seed 2018

To calculate the standard deviation in the paper, we ran with 5 different seeds 0, 1, 2, 3, and 2018 (2018 is the seed used in Moment-DETR). Best validation accuracy is yielded at the last epoch.

Inference Evaluation and Codalab Submission for QVHighlights

Once the model is trained, hl_val_submission.jsonl and hl_test_submission.jsonl can be yielded by running inference.sh.

bash qd_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash qd_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'

where direc is the path to the saved checkpoint. For more details for submission, check standalone_eval/README.md.

Pretraining and Finetuning

Pretraining with ASR captions is also available. To launch pretraining, run:

bash qd_detr/scripts/pretrain.sh 

This will pretrain the QD-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into results. With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint PRETRAIN_CHECKPOINT_PATH as:

bash qd_detr/scripts/train.sh  --resume ${PRETRAIN_CHECKPOINT_PATH}

Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.

TVSum

Training with (only video) and (video + audio) can be executed by running the shell below:

bash qd_detr/scripts/tvsum/train_tvsum.sh 
bash qd_detr/scripts/tvsum/train_tvsum_audio.sh 

Best results are stored in 'results_[domain_name]/best_metric.jsonl'.

Others

are also available as we use the official implementation for Moment-DETR as the basis. For the instructions, check their github.

QVHighlights pretrained checkpoints

Method (Modality) Model file
QD-DETR (Video+Audio) Checkpoint link
QD-DETR (Video only) Checkpoint link

Cite QD-DETR (Query-Dependent Video Representation for Moment Retrieval and Highlight Detection)

If you find this repository useful, please use the following entry for citation.

@inproceedings{moon2023query,
  title={Query-dependent video representation for moment retrieval and highlight detection},
  author={Moon, WonJun and Hyun, Sangeek and Park, SangUk and Park, Dongchan and Heo, Jae-Pil},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23023--23033},
  year={2023}
}

Contributors and Contact

If there are any questions, feel free to contact with the authors: WonJun Moon (wjun0830@gmail.com), Sangeek Hyun (hse1032@gmail.com).

LICENSE

The annotation files and many parts of the implementations are borrowed Moment-DETR. Following, our codes are also under MIT license.