The performance seems to be sensitive to the clip_len?

XiaohuJoshua commented 1 month ago

Thanks for the outstanding job. I've noticed an issue with video feature extraction using CLIP. When I use the standard clip length of 2, the results are consistent with the paper's findings. But if I reduce the clip length to 1, I get more features, scaling from a 75x512 matrix to a 150x512 one. The challenge is that the loss seems to have problems to converge.

XiaohuJoshua commented 1 month ago

For example, training log with a clip len value of 2:

2024_07_30_11_57_56 [Epoch] 001 [Loss] loss_span 0.8944 loss_giou 0.7647 loss_label 0.6384 class_error 0.9245 loss_saliency 11.9342 loss_ms_align 9.2533 loss_distill 0.0589 loss_orthogonal_dummy 0.0943 loss_span_0 0.8776 loss_giou_0 0.7436 loss_label_0 0.6362 class_error_0 0.8424 loss_span_1 0.8546 loss_giou_1 0.7480 loss_label_1 0.6372 class_error_1 1.0623 loss_overall 28.1354
2024_07_30_12_02_01 [Epoch] 002 [Loss] loss_span 0.8315 loss_giou 0.7414 loss_label 0.6369 class_error 0.0649 loss_saliency 11.3591 loss_ms_align 8.4113 loss_distill 0.0876 loss_orthogonal_dummy 0.0453 loss_span_0 0.8048 loss_giou_0 0.7327 loss_label_0 0.6339 class_error_0 -0.0000 loss_span_1 0.8158 loss_giou_1 0.7347 loss_label_1 0.6337 class_error_1 -0.0000 loss_overall 26.4688
2024_07_30_12_06_00 [Epoch] 003 [Loss] loss_span 0.8210 loss_giou 0.7313 loss_label 0.6349 class_error 0.0075 loss_saliency 11.0065 loss_ms_align 8.4994 loss_distill 0.0798 loss_orthogonal_dummy 0.0399 loss_span_0 0.8000 loss_giou_0 0.7258 loss_label_0 0.6338 class_error_0 -0.0000 loss_span_1 0.8100 loss_giou_1 0.7303 loss_label_1 0.6340 class_error_1 -0.0000 loss_overall 26.1468
2024_07_30_12_10_10 [Epoch] 004 [Loss] loss_span 0.7951 loss_giou 0.7166 loss_label 0.6338 class_error -0.0000 loss_saliency 10.5451 loss_ms_align 8.2365 loss_distill 0.0642 loss_orthogonal_dummy 0.0369 loss_span_0 0.7839 loss_giou_0 0.7188 loss_label_0 0.6336 class_error_0 -0.0000 loss_span_1 0.7880 loss_giou_1 0.7170 loss_label_1 0.6338 class_error_1 -0.0000 loss_overall 25.3032

When set it to 1 :

2024_07_30_16_30_40 [Epoch] 001 [Loss] loss_span 0.9020 loss_giou 0.7654 loss_label 0.6385 class_error 0.7384 loss_saliency 11.9768 loss_ms_align 9.4873 loss_distill 0.0578 loss_orthogonal_dummy 0.0980 loss_span_0 0.8818 loss_giou_0 0.7437 loss_label_0 0.6361 class_error_0 0.6505 loss_span_1 0.8526 loss_giou_1 0.7503 loss_label_1 0.6357 class_error_1 0.6945 loss_overall 28.4258
2024_07_30_16_39_39 [Epoch] 002 [Loss] loss_span 0.8396 loss_giou 0.7426 loss_label 0.6370 class_error 0.0692 loss_saliency 11.4533 loss_ms_align 8.4421 loss_distill 0.0921 loss_orthogonal_dummy 0.0435 loss_span_0 0.8089 loss_giou_0 0.7346 loss_label_0 0.6347 class_error_0 -0.0000 loss_span_1 0.8245 loss_giou_1 0.7411 loss_label_1 0.6343 class_error_1 -0.0000 loss_overall 26.6282
2024_07_30_16_48_03 [Epoch] 003 [Loss] loss_span 0.8181 loss_giou 0.7312 loss_label 0.6349 class_error 0.0670 loss_saliency 11.3878 loss_ms_align 8.6753 loss_distill 0.0951 loss_orthogonal_dummy 0.0331 loss_span_0 0.7946 loss_giou_0 0.7246 loss_label_0 0.6345 class_error_0 -0.0000 loss_span_1 0.8072 loss_giou_1 0.7289 loss_label_1 0.6340 class_error_1 -0.0000 loss_overall 26.6993
2024_07_30_16_56_43 [Epoch] 004 [Loss] loss_span 0.7951 loss_giou 0.7230 loss_label 0.6340 class_error -0.0000 loss_saliency 11.3285 loss_ms_align 8.5202 loss_distill 0.0923 loss_orthogonal_dummy 0.0260 loss_span_0 0.7828 loss_giou_0 0.7205 loss_label_0 0.6338 class_error_0 -0.0000 loss_span_1 0.7896 loss_giou_1 0.7213 loss_label_1 0.6337 class_error_1 -0.0000 loss_overall 26.4007

wjun0830 commented 1 month ago

Does the error kessage comes out? If not, can you the performance? Other data when trained with cliplen 1 we remeber it worked fine

XiaohuJoshua commented 1 month ago

Does the error kessage comes out? If not, can you the performance? Other data when trained with cliplen 1 we remeber it worked fine

Thanks for your reply. There is no error message reported and I found that I forget to modify the clip_len in training BaseOption. Although I correct it to 1. The performance is still lags behind experiments where video features are extracted with cliplen 2 (e.g. 15.61 R1@0.5 vs 56.39 R1@0.5 at 14-th ep).

Could there be other configurations I've missed? Have you experimented with different clip_len settings on the qvhighlights dataset?

wjun0830 commented 1 month ago

For now, the only thing I remember is that you may have to change the input parameters to PostProcessorDETR in inference.py! I am not sure if there are others to be modified.

XiaohuJoshua commented 1 month ago

For now, the only thing I remember is that you may have to change the input parameters to PostProcessorDETR in inference.py! I am not sure if there are others to be modified.

I appreciate your reply. I found another training param to be modified is the max_v_l, since the length of video feature is scaled from 75 to 150. However, the performance is still lags behind (26.58% R1@0.5 vs 56.39% at 14-th ep) cliplen 2 exs, and I found the training loss is lower than cliplen 2 exs at the beginning but get higher than cliplen 2 exs as training proceed.

I check the input parameters for PostProcessorDETR when validation,

post_processor = PostProcessorDETR(
            clip_length=opt.clip_length, min_ts_val=0, max_ts_val=150,
            min_w_l=2, max_w_l=150, move_window_method="left",
            process_func_names=("clip_ts", "round_multiple")
        )

since the clip_length is aligned with training options as 1, maximum video duration is 150s, there seems to be no other params to be changed?

XiaohuJoshua commented 1 month ago

Finally, I found cliplen 1 training can get reasonable performance but has a notable performance drop compared to cliplen 2 exs. (4.25% in R1@0.5)。Interestingly, using cliplen 2 trained models to test cliplen 1 validation videos, the performance drop is bit smaller (0.96% in R1@0.5).

awkrail commented 1 month ago

@XiaohuJoshua @wjun0830 Hi, I also found that clip_len has a impact on the performance across multiple models: Moment DETR, QD-DETR, EaTR, UVCOM, and CGDETR. When conducting experiments on Charades-STA, please be careful about changing clip_len=2 into 1.

Besides, I released Lighthouse, a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). It integrates training/evaluation codes into a single repository, supporting 6 methods, 5 datasets, and 3 features. I released the all of the trained weights and feature files.

I am glad that both of you, MR-HD researchers, use my library for your research. For details, please read our paper. Thanks.

wjun0830 commented 3 weeks ago

@awkrail Thank you!

XiaohuJoshua commented 1 week ago

@awkrail Thank you!

wjun0830 / CGDETR

The performance seems to be sensitive to the clip_len? #15