Training on Charades-STA dataset with VGG backbone

RobertLuo1 commented 1 year ago

Sorry for bothering. When I train the model on Charades-STA dataset with VGG backbone, I follow one of the issues that set the clip_len as 0.1666. However, I encounter the problem that it will easily cause the phenomenon that the neg_pool is None and it can not sample the indice.

fc5b999b079cd96adcd63ea4433e9d5

So, I wonder if any solution to that. Thanks!

wjun0830 commented 1 year ago

While we were working on VGG features, we have never seen that kinds of error. And since VGG feature we downloaded from umt has longer temporal features, we think that the negative pool should not be none. You may want to check gtst and gted.

RobertLuo1 commented 1 year ago

Thanks! I assume that maybe it is caused by the max_v_l parameter? I use the original setting 75 while training the charades-sta with VGG features. May I get the vgg feature setting?

wjun0830 commented 1 year ago

That may have caused the issue. We recommend to set it to bigger maxvl!

wjun0830 commented 1 year ago

I think we set to -1

RobertLuo1 commented 1 year ago

Thanks a lot! I wonder if -1 means use the full length of the video feautures?

wjun0830 commented 1 year ago

By setting to - 1, it will retrieve all clips except the last one. You have to add a line of code in get video by vid function!

RobertLuo1 commented 1 year ago

Thanks a lot!

RobertLuo1 commented 1 year ago

Sorry for bothering again, When I reproduce the QVHIGHLIGHT dataset with audio setting in validation set, the result has not improved but decreased insteasd. Same when I train the Charades-STA dataset using audio. May I ask if I can get the parameter of this settings?

wjun0830 commented 1 year ago

Hmm... I remember no special tuning or any changes in the experimental settings. For QV, we used either one in the script directory, i.e., train_audio and train_audio_umt. For validation set,

--	seed 0	seed 1	seed 2	seed 3	seed 2018	avg	std
MR-full-R1@0.5'	63.1	61.94	62.58	61.81	63.48	62.582	0.7216093126
MR-full-R1@0.7'	47.68	48.13	47.16	46.52	47.42	47.382	0.600433177
MR-full-mAP'	42.33	41.87	41.86	40.54	41.95	41.71	0.6817257513
MR-full-mAP@0.5'	63.06	62.09	63.46	62.16	62.57	62.668	0.5879370715
MR-full-mAP@0.75'	42.89	42.63	42.09	41.19	41.73	42.106	0.683725091
MR-long-mAP'	47.66	49.91	47.55	47.82	49.07	48.402	1.041090774
MR-middle-mAP'	44.8	43.2	43.92	42.13	43.25	43.46	0.9858752457
MR-short-mAP'	9.14	8.73	8.85	8.77	8.8	8.858	0.1636154027
HL-min-Fair-mAP'	75.7	74.41	75.01	74.19	74.39	74.74	0.6181423784
HL-min-Fair-Hit1'	77.74	74.84	76.58	74.71	75.48	75.87	1.280585803
HL-min-Good-mAP'	64.54	63.54	63.94	63.3	63.42	63.748	0.5039047529
HL-min-Good-Hit1'	75.61	73.03	74.65	73.23	73.61	74.026	1.083642007
HL-min-VeryGood-mAP'	39.92	38.75	39.22	38.96	38.78	39.126	0.4816430213
HL-min-VeryGood-Hit1'	64.97	61.29	63.94	62.32	62.13	62.93	1.490251657

which is little bit lower than the reported video only. However, when submitted to codalab, results were slightly better than video only checkpoint.

For Charades-STA, as there only exists one record for VGG+audio in our saved sheet, I think we didn't tune any parameters other than reported.

RobertLuo1 commented 1 year ago

Thanks a lot, I will check the code in detail.

wjun0830 / QD-DETR

Training on Charades-STA dataset with VGG backbone #29