Open sweet132 opened 1 year ago
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.
Hello, what do you need to prepare to run this project? I downloaded it for several days, but still couldn't successfully run it. How can I run this project?
Hello, may I ask what version of PyTorch you are using? Have you encountered any issues when using batch_first=True?
I just downloaded the code and data. It looks like 8 GPUs with 256 batch_size is essential for reproducing the project. @shams2023
The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo
代码基于 CLIP4Clip,torch 版本为 1.11.0,cuda 为 11.6 Thank you for your answer| The author mentioned in the article that the interaction module used is the co attention transformer, which part of the code is specifically implemented in?
@sweet132 Have you ever noticed that how much graphics memory do you use when batch_size=128? When I turn down the batch_size_val, I still report the error of "CUDA out of memory when evaluating. Testing model at the end!".
The modeling section isin modeling.py, which you can find what you want @shams2023
If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66
If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66
I am also very confused. Compared to CLIP4clip, it takes up a lot of memory and training time
你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。
The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo
Hello, regarding the 'msrvtt_train_with_vitb32_max1_title_titles.json' file, I didn't understand where the 'titles' data comes from. It seems that MSR-VTT dataset doesn't have this part. If the 'titles' section is obtained through web crawling, why are there 30 of them?
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.
I'm glad to hear that you've successfully reproduced our results. Regarding the batch size issue, we apologize for any confusion, and it may indeed be an oversight in the paper. Please consider our code as the practical reference.
Thank you for your reply, although I achieved similar results to the paper on msrvtt, I got poor results on msvd(46.1), where I trained directly on the raw data, while for vatex(62.0) dataset, I used the extracted frames you uploaded. I'm not sure why is that. @whwu95
Hello, I suggest you refer to the paper, the titles are generated by model (gpt-2 or clip) @Tiiivoo
建模部分在modeling.py中,你可以在里面找到你想要的@shams2023
How to complete this task for a single card 3090?
+1
你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.
I want to seek some help from you! In train_ In video.py, the first 5 epochs are used to train the video-query branch, so why do we calculate the caption value in the forward propagation of the model?
As shown in the following figure:
Aren't these first 5 epochs only used to train text encoders for text? (i.e., query encoder), then if caption is added at this point, does not it mean that the caption encoder has also been trained? I am confused about this part and hope to receive your help! Thank you again for disturbing your time. Thank you! (这前5个epoch对于文本来说,不是只训练文本编码器的吗?(即 query encoder),那么此时如果加入了caption,那么不就也训练了caption encoder了吗?我对这一部分,很困惑,期望能够得到你的帮助! 再次感谢,打扰到你的时间了,谢谢!)
你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。
Can you send this out for me to refer to? I don't know how to define the storage location of the variables here. As shown in the following figure: (It can also be co_train_msrvtt. sh)
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.