Closed RobertLuo1 closed 1 year ago
While we were working on VGG features, we have never seen that kinds of error. And since VGG feature we downloaded from umt has longer temporal features, we think that the negative pool should not be none. You may want to check gtst and gted.
Thanks! I assume that maybe it is caused by the max_v_l parameter? I use the original setting 75 while training the charades-sta with VGG features. May I get the vgg feature setting?
That may have caused the issue. We recommend to set it to bigger maxvl!
I think we set to -1
Thanks a lot! I wonder if -1 means use the full length of the video feautures?
By setting to - 1, it will retrieve all clips except the last one. You have to add a line of code in get video by vid function!
Thanks a lot!
Sorry for bothering again, When I reproduce the QVHIGHLIGHT dataset with audio setting in validation set, the result has not improved but decreased insteasd. Same when I train the Charades-STA dataset using audio. May I ask if I can get the parameter of this settings?
Hmm... I remember no special tuning or any changes in the experimental settings. For QV, we used either one in the script directory, i.e., train_audio and train_audio_umt. For validation set,
-- | seed 0 | seed 1 | seed 2 | seed 3 | seed 2018 | avg | std |
---|---|---|---|---|---|---|---|
MR-full-R1@0.5' | 63.1 | 61.94 | 62.58 | 61.81 | 63.48 | 62.582 | 0.7216093126 |
MR-full-R1@0.7' | 47.68 | 48.13 | 47.16 | 46.52 | 47.42 | 47.382 | 0.600433177 |
MR-full-mAP' | 42.33 | 41.87 | 41.86 | 40.54 | 41.95 | 41.71 | 0.6817257513 |
MR-full-mAP@0.5' | 63.06 | 62.09 | 63.46 | 62.16 | 62.57 | 62.668 | 0.5879370715 |
MR-full-mAP@0.75' | 42.89 | 42.63 | 42.09 | 41.19 | 41.73 | 42.106 | 0.683725091 |
MR-long-mAP' | 47.66 | 49.91 | 47.55 | 47.82 | 49.07 | 48.402 | 1.041090774 |
MR-middle-mAP' | 44.8 | 43.2 | 43.92 | 42.13 | 43.25 | 43.46 | 0.9858752457 |
MR-short-mAP' | 9.14 | 8.73 | 8.85 | 8.77 | 8.8 | 8.858 | 0.1636154027 |
HL-min-Fair-mAP' | 75.7 | 74.41 | 75.01 | 74.19 | 74.39 | 74.74 | 0.6181423784 |
HL-min-Fair-Hit1' | 77.74 | 74.84 | 76.58 | 74.71 | 75.48 | 75.87 | 1.280585803 |
HL-min-Good-mAP' | 64.54 | 63.54 | 63.94 | 63.3 | 63.42 | 63.748 | 0.5039047529 |
HL-min-Good-Hit1' | 75.61 | 73.03 | 74.65 | 73.23 | 73.61 | 74.026 | 1.083642007 |
HL-min-VeryGood-mAP' | 39.92 | 38.75 | 39.22 | 38.96 | 38.78 | 39.126 | 0.4816430213 |
HL-min-VeryGood-Hit1' | 64.97 | 61.29 | 63.94 | 62.32 | 62.13 | 62.93 | 1.490251657 |
which is little bit lower than the reported video only. However, when submitted to codalab, results were slightly better than video only checkpoint.
For Charades-STA, as there only exists one record for VGG+audio in our saved sheet, I think we didn't tune any parameters other than reported.
Thanks a lot, I will check the code in detail.
Sorry for bothering. When I train the model on Charades-STA dataset with VGG backbone, I follow one of the issues that set the clip_len as 0.1666. However, I encounter the problem that it will easily cause the phenomenon that the neg_pool is None and it can not sample the indice.
So, I wonder if any solution to that. Thanks!