Poor performance when reproduce model on ActivityNet.

xuguohai / X-CLIP

An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"

https://arxiv.org/abs/2207.07285

MIT License

127 stars 15 forks source link

Poor performance when reproduce model on ActivityNet. #3

Closed jianghaojun closed 1 year ago

jianghaojun commented 1 year ago

Due to the huge size of original dataset, I extracted images from the original videos with FPS=1, and trained the CLIP4clip(meanP) on 8 RTX3090. Due to the GPU memory constrain, I set the gradient_accumulation_steps=2.

The caption is downloaded from https://cs.stanford.edu/people/ranjaykrishna/densevid/.

I first try to reproduce the results of CLIP4clip(meanP / ViT-B/32) on ActivityNet and get R@1=37.9 which is much worse than 40.5 reported in Table 5.

Do authors have any useful experience on this issue? Thanks very much!

xmu-xiaoma666 commented 1 year ago

Due to the huge size of original dataset, I extracted images from the original videos with FPS=1, and trained the CLIP4clip(meanP) on 8 RTX3090. Due to the GPU memory constrain, I set the gradient_accumulation_steps=2.

The caption is downloaded from https://cs.stanford.edu/people/ranjaykrishna/densevid/.

I first try to reproduce the results of CLIP4clip(meanP / ViT-B/32) on ActivityNet and get R@1=37.9 which is much worse than 40.5 reported in Table 5.

Do authors have any useful experience on this issue? Thanks very much!

We strongly recommend using raw video for training.

If images(frames) are used for training, the number of frames sampled during each video training is unchanged, which may cause performance degradation.

jianghaojun commented 1 year ago

Still unable to reproduce the result with 5 epochs. I will try to enlarge the training epoch.

JingXiaolun commented 10 months ago

@xmu-xiaoma666 , I try to reproduce the LSMDC results in the paper, while after 5 training epochs, the MeanR is about 200, and the R@1 is about 13.0. Part of training log is shown as follows: ![Uploading 1699865283647.png…]() I don't understand the huge discrepancy between paper and my experiments. I hope to get some helpful advice. Thanks!

dongfengxijian commented 8 months ago

@JingXiaolun The training log picture can be found.