wjun0830 / QD-DETR

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)
https://arxiv.org/abs/2303.13874
Other
183 stars 13 forks source link

Training on Charades #1

Closed Lonicer closed 1 year ago

Lonicer commented 1 year ago

Excuse me, how does your code train and test on the charades dataset? There are no related commands and information on GitHub, thank you.

wjun0830 commented 1 year ago

Hello. Thanks for your interest in our work.

For features, VGG features: are available at UMT. C3D features: can be extracted by following the instructions here. SF+C features: We followed Moment-DETR to use Linjie's Hero. Similar to C3D features, download the dataset and use the pretrained weight to extract features every 1 second. Labels are available in UMT GitHub and issue in Moment-DETR.

The Charades-STA dataset code is not available in this repository. However, you may easily find a way to implement one because it follows the same format as QVHighlights. Check https://github.com/jayleicn/moment_detr/issues/11.

Lonicer commented 1 year ago

Thanks

Hello. Thanks for your interest in our work.

For features, VGG features: are available at UMT. C3D features: can be extracted by following the instructions here. SF+C features: We followed Moment-DETR to use Linjie's Hero. Similar to C3D features, download the dataset and use the pretrained weight to extract features every 1 second. Labels are available in UMT GitHub and issue in Moment-DETR.

The Charades-STA dataset code is not available in this repository. However, you may easily find a way to implement one because it follows the same format as QVHighlights. Check jayleicn/moment_detr#11.

Lonicer commented 1 year ago

Hello. Thanks for your interest in our work.

For features, VGG features: are available at UMT. C3D features: can be extracted by following the instructions here. SF+C features: We followed Moment-DETR to use Linjie's Hero. Similar to C3D features, download the dataset and use the pretrained weight to extract features every 1 second. Labels are available in UMT GitHub and issue in Moment-DETR.

The Charades-STA dataset code is not available in this repository. However, you may easily find a way to implement one because it follows the same format as QVHighlights. Check jayleicn/moment_detr#11.

Excuse me again, where do the features of text queries in Charades come from? The link you provided seems to only feature videos, thanks!

wjun0830 commented 1 year ago

We used clip-pretrained models to extract text features for every word. CLIP link

Lonicer commented 1 year ago

We used clip-pretrained models to extract text features for every word. CLIP link

Thanks!

Lonicer commented 1 year ago

Hello. Thanks for your interest in our work.

For features, VGG features: are available at UMT. C3D features: can be extracted by following the instructions here. SF+C features: We followed Moment-DETR to use Linjie's Hero. Similar to C3D features, download the dataset and use the pretrained weight to extract features every 1 second. Labels are available in UMT GitHub and issue in Moment-DETR.

The Charades-STA dataset code is not available in this repository. However, you may easily find a way to implement one because it follows the same format as QVHighlights. Check jayleicn/moment_detr#11. Hello, I followed the link you gave: jayleicn/moment_detr#11. I tried to add the corresponding code in start_end_dataset.py:

elif self.dset_name == 'charades':
model_inputs["span_labels"] = self. get_span_labels(meta["relevant_windows"], ctx_l) # (#windows, 2)
model_inputs["saliency_pos_labels"], model_inputs["saliency_neg_labels"], model_inputs["saliency_all_labels"] = \
self.get_saliency_labels_sub_as_query(meta["relevant_windows"][0], ctx_l) # only one gt

The training did not report an error, but the following error was reported during the evaluation. Do you know what caused it? Traceback (most recent call last): File "qd_detr/train.py", line 418, in best_ckpt_path, eval_split_name, eval_path, debug, opt = start_training() File "qd_detr/train.py", line 412, in start_training train(model, criterion, optimizer, lr_scheduler, train_dataset, eval_dataset, opt) File "qd_detr/train.py", line 162, in train logger.info("metrics_no_nms {}".format(pprint.pformat(metrics_no_nms["brief"], indent=4))) TypeError: 'NoneType' object is not subscriptable

Thanks!

hse1032 commented 1 year ago

Hello,

Your error message says that variable "metrics_no_nms" is NoneType object.

I recommend checking the return values of the evaluation code. Especially, charadesSTA dataset does not have highlight detection labels, so I suspect that this may cause errors like yours.

Lonicer commented 1 year ago

thank you for your reply,

  1. After my investigation, it turned out that the video length in the verification set was all below 30s, and there was no long video, so an error was reported when calculating the corresponding mAP. Then I changed the corresponding code of eval.py to the following:
    if not opt.dset_name == "charades":
         length_ranges = [[0, 10], [10, 30], [30, 150], [0, 150], ]
         range_names = ["short", "middle", "long", "full"]
     else:
         length_ranges = [[0, 10], [10, 30], [0, 150], ]
         range_names = ["short", "middle", "full"]

    And, because I can't get the corresponding relevant_clip_ids and other tags, I also removed the following code in eval.py

    highlight_det_scores = eval_highlight(submission, ground_truth, verbose=verbose)
                 eval_metrics. update(highlight_det_scores)
                 highlight_det_scores_brief = dict([
                     (f"{k}-{sub_k.split('-')[1]}", v[sub_k])
                     for k, v in highlight_det_scores.items() for sub_k in v])
                 eval_metrics_brief.update(highlight_det_scores_brief)

    However, after doing this, the results obtained are abnormally low, do you know the reason? Also, does the clip_length in the Charades dataset need to be adjusted?

  2. How did you finally set up the Ground based on the Charades dataset? What about relevant_clip_ids and saliency_scores in Truth?
hse1032 commented 1 year ago

Hello,

Belows are answers for your questions.

  1. As noted in https://github.com/jayleicn/moment_detr/issues/11#issuecomment-1076994428, we use clip_length = 1 for charadesSTA dataset. Importantly, the features from HERO also should be extracted by "clip_len=1".
  2. If Ground means saliency score labels, we calculate "relevant_clip_ids" using "relevant_windows" in dataloader and set the saliency scores of relevant clips as 1.

I hope this would resolve your questions. Thanks,

Lonicer commented 1 year ago

Hello,

Belows are answers for your questions.

  1. As noted in About experiments on CharadesSTA dataset jayleicn/moment_detr#11 (comment), we use clip_length = 1 for charadesSTA dataset. Importantly, the features from HERO also should be extracted by "clip_len=1".
  2. If Ground means saliency score labels, we calculate "relevant_clip_ids" using "relevant_windows" in dataloader and set the saliency scores of relevant clips as 1.

I hope this would resolve your questions. Thanks,

Thank you for your reply, I will try again, thank you!

Lonicer commented 1 year ago

Hello,

Belows are answers for your questions.

  1. As noted in About experiments on CharadesSTA dataset jayleicn/moment_detr#11 (comment), we use clip_length = 1 for charadesSTA dataset. Importantly, the features from HERO also should be extracted by "clip_len=1".
  2. If Ground means saliency score labels, we calculate "relevant_clip_ids" using "relevant_windows" in dataloader and set the saliency scores of relevant clips as 1.

I hope this would resolve your questions. Thanks,

Hello, I used your code to run successfully on the Charades dataset, but since 1 second in the VGG dataset corresponds to 6 feature maps, I used the method of interval value to get the corresponding clip feature

  1. Is this the right way to do it? Moreover, the result of my training is about 1 point different from the result in your paper. I used the above interval value and the CLIP text feature extracted by myself. Is the result normal?
  2. The model VSLNet in the C3D link you provided uses the I3D feature, is it? If not, please tell me the specific method of extracting C3D features, thank you!
hse1032 commented 1 year ago

Hello,

  1. We follow experimental settings from UMT (https://github.com/TencentARC/UMT) when we use VGG features for charades-STA dataset. As I remember, we set the clip_len = 0.1666 instead of subsampling the feature as noted in https://github.com/TencentARC/UMT/issues/30#issuecomment-1302196153. Besides, in their configuration, they use glove text embedding instead of CLIP feature, so we also use glove embedding using torchtxt library.

  2. You can try C3D features in the below google drive link. https://drive.google.com/file/d/1CcMwae55Tuve_Ksrp5kONycyR1bVcX8D/view

Lonicer commented 1 year ago

Sorry to bother you again, with your help, I have successfully reproduced the VGG and C3D features of the given model on the Charades dataset, but when I use HERO to extract video features, the GPU version is too high , cannot run the docker image, so the CLIP feature is missing for experimentation. Could you please provide the CLIP features of the corresponding Charades dataset? Thanks!

hpppppp8 commented 1 year ago

We used clip-pretrained models to extract text features for every word. CLIP link

I wonder how to only extract text features from CLIP? This link Are you do like this?

wjun0830 commented 1 year ago

We used clip-pretrained models to extract text features for every word. CLIP link

I wonder how to only extract text features from CLIP? This link Are you do like this?

According the README file in CLIP repository, you can use the following instruction.

text_features = model.encode_text(text)

Meanwhile, you have to modify "def encode_text(self, text):" in model.py. Check line 352 here line 352 and modify clip file according to it (We need last hidden state to extract word-wise text features).

wjun0830 commented 1 year ago

Sorry to bother you again, with your help, I have successfully reproduced the VGG and C3D features of the given model on the Charades dataset, but when I use HERO to extract video features, the GPU version is too high , cannot run the docker image, so the CLIP feature is missing for experimentation. Could you please provide the CLIP features of the corresponding Charades dataset? Thanks!

We will look for it. I am not sure we still have them.

hpppppp8 commented 1 year ago

We used clip-pretrained models to extract text features for every word. CLIP link

I wonder how to only extract text features from CLIP? This link Are you do like this?

According the README file in CLIP repository, you can use the following instruction.

text_features = model.encode_text(text)

Meanwhile, you have to modify "def encode_text(self, text):" in model.py. Check line 352 here line 352 and modify clip file according to it (We need last hidden state to extract word-wise text features).

thx!!!

Lonicer commented 1 year ago

Sorry to bother you again, with your help, I have successfully reproduced the VGG and C3D features of the given model on the Charades dataset, but when I use HERO to extract video features, the GPU version is too high , cannot run the docker image, so the CLIP feature is missing for experimentation. Could you please provide the CLIP features of the corresponding Charades dataset? Thanks!

We will look for it. I am not sure we still have them.

sorry for the trouble, thank you!

wjun0830 commented 1 year ago

Sorry to bother you again, with your help, I have successfully reproduced the VGG and C3D features of the given model on the Charades dataset, but when I use HERO to extract video features, the GPU version is too high , cannot run the docker image, so the CLIP feature is missing for experimentation. Could you please provide the CLIP features of the corresponding Charades dataset? Thanks!

We will look for it. I am not sure we still have them.

sorry for the trouble, thank you!

You can download from here. https://drive.google.com/drive/folders/1ifon-YUajMKKVX-5mWcOHQD4v3h6yEdQ?usp=share_link

Lonicer commented 1 year ago

Sorry to bother you again, with your help, I have successfully reproduced the VGG and C3D features of the given model on the Charades dataset, but when I use HERO to extract video features, the GPU version is too high , cannot run the docker image, so the CLIP feature is missing for experimentation. Could you please provide the CLIP features of the corresponding Charades dataset? Thanks!

We will look for it. I am not sure we still have them.

sorry for the trouble, thank you!

You can download from here. https://drive.google.com/drive/folders/1ifon-YUajMKKVX-5mWcOHQD4v3h6yEdQ?usp=share_link

Thanks!!

EdenGabriel commented 1 year ago

Hello, Belows are answers for your questions.

  1. As noted in About experiments on CharadesSTA dataset jayleicn/moment_detr#11 (comment), we use clip_length = 1 for charadesSTA dataset. Importantly, the features from HERO also should be extracted by "clip_len=1".
  2. If Ground means saliency score labels, we calculate "relevant_clip_ids" using "relevant_windows" in dataloader and set the saliency scores of relevant clips as 1.

I hope this would resolve your questions. Thanks,

Hello, I used your code to run successfully on the Charades dataset, but since 1 second in the VGG dataset corresponds to 6 feature maps, I used the method of interval value to get the corresponding clip feature

  1. Is this the right way to do it? Moreover, the result of my training is about 1 point different from the result in your paper. I used the above interval value and the CLIP text feature extracted by myself. Is the result normal?
  2. The model VSLNet in the C3D link you provided uses the I3D feature, is it? If not, please tell me the specific method of extracting C3D features, thank you!

Excuse me, Can you reproduce results similar to the paper when using C3D features? I couldn't reproduce the results reported in the paper on the Charades dataset, even after setting the parameters according to your discussions.

wjun0830 commented 11 months ago

Sorry all for the inconvenience. Charades-STA experiments with C3D features are actually conducted with I3D features and I3D benchmarking tables. Features are provided here from VSLNET.

Sorry again for the confusion.

Lonicer commented 11 months ago

Sorry all for the inconvenience. Charades-STA experiments with C3D features are actually conducted with I3D features and I3D benchmarking tables. Features are provided here from VSLNET.

Sorry again for the confusion. Thank you for your supplement. Do you mean that all the C3D features in this column are the corresponding I3D features and experimental results?

image

wjun0830 commented 11 months ago

Yes that's right. You can check the arxiv version of VSLNET!

Lonicer commented 11 months ago

Yes that's right. You can check the arxiv version of VSLNET!

Thanks!!!