QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection (CVPR 2023 Paper)

by WonJun Moon¹, SangEek Hyun1, SangUk Park², Dongchan Park², Jae-Pil Heo¹

¹ Sungkyunkwan University, ² Pyler, ^* Equal Contribution

[Arxiv] [Paper] [Project Page] [Video]

Updates & News

Charades-STA experiments with C3D features are actually conducted with I3D features and I3D benchmarking tables. Features are provided here from VSLNET Github. Sorry for the inconvenience.
Our new paper on moment retrieval and highlight detection is now available at [CG-DETR arxiv] (Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding). Codes will be soon available at [CG-DETR Github].

Prerequisites

0. Clone this repo

1. Prepare datasets

(2023/11/21) For a newer version of instructions for preparing datasets, please refer to CG-DETR.

QVHighlights : Download official feature files for QVHighlights dataset from Moment-DETR.

Download moment_detr_features.tar.gz (8GB), extract it under '../features' directory. You can change the data directory by modifying 'feat_root' in shell scripts under 'qd_detr/scripts/' directory.

tar -xf path/to/moment_detr_features.tar.gz

TVSum : Download feature files for TVSum dataset from UMT.

Download TVSum (69.1MB), and either extract it under '../features/tvsum/' directory or change 'feat_root' in TVSum shell files under 'qd_detr/scripts/tvsum/'.

2. Install dependencies. Python version 3.7 is required.

pip install -r requirements.txt

For anaconda setup, please refer to the official Moment-DETR github.

QVHighlights

Training

Training with (only video) and (video + audio) can be executed by running the shell below:

bash qd_detr/scripts/train.sh --seed 2018
bash qd_detr/scripts/train_audio.sh --seed 2018

To calculate the standard deviation in the paper, we ran with 5 different seeds 0, 1, 2, 3, and 2018 (2018 is the seed used in Moment-DETR). Best validation accuracy is yielded at the last epoch.

Inference Evaluation and Codalab Submission for QVHighlights

Once the model is trained, hl_val_submission.jsonl and hl_test_submission.jsonl can be yielded by running inference.sh.

bash qd_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash qd_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'

where direc is the path to the saved checkpoint. For more details for submission, check standalone_eval/README.md.

Pretraining and Finetuning

Pretraining with ASR captions is also available. To launch pretraining, run:

bash qd_detr/scripts/pretrain.sh

This will pretrain the QD-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into results. With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint PRETRAIN_CHECKPOINT_PATH as:

bash qd_detr/scripts/train.sh  --resume ${PRETRAIN_CHECKPOINT_PATH}

Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.

TVSum

Training with (only video) and (video + audio) can be executed by running the shell below:

bash qd_detr/scripts/tvsum/train_tvsum.sh 
bash qd_detr/scripts/tvsum/train_tvsum_audio.sh

Best results are stored in 'results_[domain_name]/best_metric.jsonl'.

Others

Pretraining with ASR captions
Runninng predictions on customized datasets

are also available as we use the official implementation for Moment-DETR as the basis. For the instructions, check their github.

QVHighlights pretrained checkpoints

Method (Modality)	Model file
QD-DETR (Video+Audio) Checkpoint	link
QD-DETR (Video only) Checkpoint	link

Cite QD-DETR (Query-Dependent Video Representation for Moment Retrieval and Highlight Detection)

If you find this repository useful, please use the following entry for citation.

@inproceedings{moon2023query,
  title={Query-dependent video representation for moment retrieval and highlight detection},
  author={Moon, WonJun and Hyun, Sangeek and Park, SangUk and Park, Dongchan and Heo, Jae-Pil},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23023--23033},
  year={2023}
}

Contributors and Contact

If there are any questions, feel free to contact with the authors: WonJun Moon (wjun0830@gmail.com), Sangeek Hyun (hse1032@gmail.com).

LICENSE

The annotation files and many parts of the implementations are borrowed Moment-DETR. Following, our codes are also under MIT license.

wjun0830 / QD-DETR

readme