Closed jianghaojun closed 1 year ago
Due to the huge size of original dataset, I extracted images from the original videos with FPS=1, and trained the CLIP4clip(meanP) on 8 RTX3090. Due to the GPU memory constrain, I set the gradient_accumulation_steps=2.
The caption is downloaded from https://cs.stanford.edu/people/ranjaykrishna/densevid/.
I first try to reproduce the results of CLIP4clip(meanP / ViT-B/32) on ActivityNet and get R@1=37.9 which is much worse than 40.5 reported in Table 5.
Do authors have any useful experience on this issue? Thanks very much!
We strongly recommend using raw video for training.
If images(frames) are used for training, the number of frames sampled during each video training is unchanged, which may cause performance degradation.
Still unable to reproduce the result with 5 epochs. I will try to enlarge the training epoch.
@xmu-xiaoma666 , I try to reproduce the LSMDC results in the paper, while after 5 training epochs, the MeanR is about 200, and the R@1 is about 13.0. Part of training log is shown as follows: ![Uploading 1699865283647.png…]() I don't understand the huge discrepancy between paper and my experiments. I hope to get some helpful advice. Thanks!
@JingXiaolun The training log picture can be found.
Due to the huge size of original dataset, I extracted images from the original videos with FPS=1, and trained the CLIP4clip(meanP) on 8 RTX3090. Due to the GPU memory constrain, I set the gradient_accumulation_steps=2.
The caption is downloaded from https://cs.stanford.edu/people/ranjaykrishna/densevid/.
I first try to reproduce the results of CLIP4clip(meanP / ViT-B/32) on ActivityNet and get R@1=37.9 which is much worse than 40.5 reported in Table 5.
Do authors have any useful experience on this issue? Thanks very much!