whwu95 / Cap4Video

【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
https://arxiv.org/abs/2301.00184
MIT License
225 stars 16 forks source link

Low R1 performance in the 2nd stage #9

Open chenhao2345 opened 1 year ago

chenhao2345 commented 1 year ago

Thanks for sharing your code. Is it normal to get R1=30 with train_titles.py? After running the score fusion, the title matrix does not improve the video matrix.

BishmoyPaul commented 1 year ago

I am having the same issue. Incidentally, did you have any such error with the fusion score? I was running it on MSVD

Text-to-Video:
>>>  R@1: 30.4 - R@5: 59.7 - R@10: 70.7 - Median R: 3.0 - Mean R: 19.8
Video-to-Text:
>>>  V2T$R@1: 33.1 - V2T$R@5: 60.1 - V2T$R@10: 72.6 - V2T$Median R: 3.0 - V2T$Mean R: 18.4
video_matrix sim matrix size: (27763, 670), (27763, 670)
titles_shot_matrix sim matrix size: (27763, 670), (27763, 670)
Traceback (most recent call last):
  File "/local/Cap4Video/train_titles.py", line 723, in <module>
    fusion_scores()
  File "/local/Cap4Video/sim_matrix/fusion_scores.py", line 13, in fusion_scores
    tv_video_metrics = compute_metrics(video_matrix)
  File "/local/Cap4Video/metrics.py", line 13, in compute_metrics
    ind = sx - d
ValueError: operands could not be broadcast together with shapes (27763,670) (670,1) 
chenhao2345 commented 1 year ago

@BishmoyPaul I'm running it on MSRVTT. I have not seen any problems with the fusion score on MSRVTT.

JosephPai commented 1 year ago

Same problem here. @whwu95 I got Rank-1 47.7 in the first stage train_video.py And Rank-1 around 30 in the second stage train_titles.py.

JosephPai commented 1 year ago

BTW, do you know the purpose of fusion_scores? @chenhao2345

chenhao2345 commented 1 year ago

@JosephPai I got similar performance. ~47.5 in stage 1 and 30 in stage 2.

It seems to me that the authors get two similarity scores from stage 1 and stage 2, respectively. Then, they use fusion_scores to fuze the two similarity scores.

ASENNIU commented 10 months ago

I got R@1 45.3 in stage and 29.6 in stage 2, it seems like that the code is to do global matching ?

shams2023 commented 10 months ago

I got R@1 45.3 in stage and 29.6 in stage 2, it seems like that the code is to do global matching ?

i think its true

ASENNIU commented 10 months ago

Thanks for sharing your code. And how can I get the score 49 for R@1?

zef1611 commented 10 months ago

@chenhao2345 @JosephPai @ASENNIU @BishmoyPaul Hi, can I know your batch size setting and the number of gpus you are using for training stage 1 & stage 2?

fazlicodes commented 9 months ago

@zef1611 did you find the batch size, number of gpus and gpu type used in this project? can anyone please answer this? @chenhao2345 @JosephPai @ASENNIU @BishmoyPaul

fazlicodes commented 8 months ago

I am having the same issue. Incidentally, did you have any such error with the fusion score? I was running it on MSVD

Text-to-Video:
>>>  R@1: 30.4 - R@5: 59.7 - R@10: 70.7 - Median R: 3.0 - Mean R: 19.8
Video-to-Text:
>>>  V2T$R@1: 33.1 - V2T$R@5: 60.1 - V2T$R@10: 72.6 - V2T$Median R: 3.0 - V2T$Mean R: 18.4
video_matrix sim matrix size: (27763, 670), (27763, 670)
titles_shot_matrix sim matrix size: (27763, 670), (27763, 670)
Traceback (most recent call last):
  File "/local/Cap4Video/train_titles.py", line 723, in <module>
    fusion_scores()
  File "/local/Cap4Video/sim_matrix/fusion_scores.py", line 13, in fusion_scores
    tv_video_metrics = compute_metrics(video_matrix)
  File "/local/Cap4Video/metrics.py", line 13, in compute_metrics
    ind = sx - d
ValueError: operands could not be broadcast together with shapes (27763,670) (670,1) 

@BishmoyPaul How did you train on the MSVD dataset? If you were using the co_train_msrvtt.sh script what did you give for --data_path Could you share the training script