whwu95 / Cap4Video

【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
https://arxiv.org/abs/2301.00184
MIT License
240 stars 20 forks source link

Question about implementation details. #15

Open sweet132 opened 1 year ago

sweet132 commented 1 year ago

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

shams2023 commented 1 year ago

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

Hello, what do you need to prepare to run this project? I downloaded it for several days, but still couldn't successfully run it. How can I run this project?

Tiiivoo commented 1 year ago

Hello, may I ask what version of PyTorch you are using? Have you encountered any issues when using batch_first=True?

sweet132 commented 1 year ago

I just downloaded the code and data. It looks like 8 GPUs with 256 batch_size is essential for reproducing the project. @shams2023

sweet132 commented 1 year ago

The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo

shams2023 commented 1 year ago

代码基于 CLIP4Clip,torch 版本为 1.11.0,cuda 为 11.6 Thank you for your answer| The author mentioned in the article that the interaction module used is the co attention transformer, which part of the code is specifically implemented in?

shallowdream66 commented 1 year ago

@sweet132 Have you ever noticed that how much graphics memory do you use when batch_size=128? When I turn down the batch_size_val, I still report the error of "CUDA out of memory when evaluating. Testing model at the end!".

sweet132 commented 1 year ago

The modeling section isin modeling.py, which you can find what you want @shams2023

sweet132 commented 1 year ago

If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66

shallowdream66 commented 1 year ago

If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66

I am also very confused. Compared to CLIP4clip, it takes up a lot of memory and training time

Tiiivoo commented 12 months ago

你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。

The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo

Hello, regarding the 'msrvtt_train_with_vitb32_max1_title_titles.json' file, I didn't understand where the 'titles' data comes from. It seems that MSR-VTT dataset doesn't have this part. If the 'titles' section is obtained through web crawling, why are there 30 of them?

whwu95 commented 12 months ago

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

I'm glad to hear that you've successfully reproduced our results. Regarding the batch size issue, we apologize for any confusion, and it may indeed be an oversight in the paper. Please consider our code as the practical reference.

sweet132 commented 12 months ago

Thank you for your reply, although I achieved similar results to the paper on msrvtt, I got poor results on msvd(46.1), where I trained directly on the raw data, while for vatex(62.0) dataset, I used the extracted frames you uploaded. I'm not sure why is that. @whwu95

sweet132 commented 12 months ago

Hello, I suggest you refer to the paper, the titles are generated by model (gpt-2 or clip) @Tiiivoo

shams2023 commented 11 months ago

建模部分在modeling.py中,你可以在里面找到你想要的@shams2023

How to complete this task for a single card 3090?

fazliimam commented 11 months ago

+1

shams2023 commented 11 months ago

你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

I want to seek some help from you! In train_ In video.py, the first 5 epochs are used to train the video-query branch, so why do we calculate the caption value in the forward propagation of the model?

As shown in the following figure: image

Aren't these first 5 epochs only used to train text encoders for text? (i.e., query encoder), then if caption is added at this point, does not it mean that the caption encoder has also been trained? I am confused about this part and hope to receive your help! Thank you again for disturbing your time. Thank you! (这前5个epoch对于文本来说,不是只训练文本编码器的吗?(即 query encoder),那么此时如果加入了caption,那么不就也训练了caption encoder了吗?我对这一部分,很困惑,期望能够得到你的帮助! 再次感谢,打扰到你的时间了,谢谢!)

shams2023 commented 11 months ago

你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。

Can you send this out for me to refer to? I don't know how to define the storage location of the variables here. As shown in the following figure: (It can also be co_train_msrvtt. sh)

image