Closed yixuan-qiao closed 3 years ago
Hi, yixuan;
I met the same question as you. I cannot use the same batch_size to reproduce the results.
But I am curious about how fast the training speed is in your machine? Have you tried the pretraining code (not the vaq fine-tuning)? I found that the pretraining is very slow
Thanks Yongfei
Sorry for the late response. I will answer your questions one by one below.
a) Yes, your understanding is correct. 3072 refers to the total tokens, not real batch sizes. Sorry for the confusion.
b) You should keep num_tokens num_GPUs num_Grad_Accu the same. So, in this case, if we run the code under 3072 tokens, 8 GPUs and 4 Grad Accu., then, when changing to 1024 tokens, using the same 8 GPUs, your Grad Accu. should be 12. This will help you reproduce the results.
c) The config json file cannot exactly reproduce the best VQA results in our paper. In experiments, we observed that when setting "conf_th" to a smaller number (which controls how many bounding boxes we have for an image), it will result in a better performance. However, the provided default image features set "conf_th" to 0.2, which means that we need to host a new set of image features with a smaller "conf_th". We will try to host this after the new year holiday.
Adv. Lr. is not a very sensitive hyper-parameter. You can try different values, but generally, it will result in similar performance. I will try to find the exact config file to reproduce our best results.
Hope it helps. Thanks for your interest in our code.
Best, Zhe
@youngfly11 , sorry that you found your pre-training very slow. Typically, it will be 2x slower than standard UNITER pre-training. However, we did not really measure this, as we ran the code on our GPU clusters. Did you try standard UNITER pre-training? Is it also slow? We did not find the pre-training slow in our experiments.
For exactly reproducing the best VQA results in our paper, we will try to provide it after the holiday.
Best, Zhe
But I am curious about how fast the training speed is in your machine? Have you tried the pretraining code (not the vaq fine-tuning)? I found that the pretraining is very slow
Hi @youngfly11,
I haven't tried pre-training stage. Maybe some fine-tuning speed info can offer you additional help. I fine-tuned on 4 V100(16G) using the train-vqa-large-8gpu-adv.json which takes about 20h. Besides, a painful and sad truth is that i haven't found the best config to reproduce the best results in paper until now.--!
Hi, thanks for the advice @zhegan27
Sorry to interrupt your holiday. I am not mean to :-) I just wonder know in your paper, you use batchsize 3072, grad.acc 5, training steps 5000 for VQA task, is this parameter setting on a single Titan RTX GPUs or maybe 8 machines? i found grad.acc is a sensitive hyper-parameter, i try [12, 16, 24], got very different performance. maybe need other scale because of the number of machines?
@youngfly11 @yixuan-qiao , finetuning UNITER-large with adversarial training on the VQA dataset using 20 hours is reasonable, as adversarial training itself is more heavier than standard training.
@yixuan-qiao sorry that you have not been able to reproduce our best VQA results. I am very happy to help you on this.
When doing experiments back then, I have been running many experiments on difference settings, such as under 4, 8, or 16 GPUs. To partially solve your concern, I have dug out the best config file that we use to obtain the best results in our paper (test-dev/std: 74.69/74.87). This corresponds to 72.92 accuracy on our internal dev set. The config and log file is provided in ./reproducibility-vqa folder. In the config file, "conf_th": 0.075, which the current provided image features do not support, as most of our experiments use "conf_th" as 0.2. I will try to provide this feature.
Q: Is this parameter setting on a single Titan RTX GPUs or maybe 8 machines? A: It corresponds to 8 machines in my impression. I will try to run the code myself again to double check this. But definitely it is not for a single Titan RTX.
Q: I found grad.acc is a sensitive hyper-parameter, I try [12, 16, 24], got very different performance. Maybe need other scale because of the number of machines? A: This is true. If you have fewer machines, please try larger Grad. Acc. steps. Generally, keep num_tokens num_GPUs num_Grad_Accu the same in order to obtain similar results.
I will come back to this when I have more free time. For other experiments such as VCR, the results should be much easier to reproduce. Thanks and Happy New Year.
Best, Zhe
Thanks a lot for your patience reply @zhegan27 I will try reproduce image features with conf_th 0.075 first, It would be great if you have time to share :-). Within your new hps.json, some parameters are set to different values compare to my experiments, especially the learning rate decay schedule and training steps. You probably use vqa_schedule from MCAN, i will also try it.
Thanks a lot. Let's keep in touch. Looking forward to your update. :-) Happy New Year!!!
@yixuan-qiao, the image features and config files that can be used to reproduce our VQA best results have been updated in the repo 2 days ago. Please have a check, and let us know if any further questions. Thank you.
Best, Zhe
@zhegan27, thanks for that, by now i can reproduce the single best performance model. many thanks!!! :-)
Hi, Thanks for your excellent work. I am not sure the batchsize in your paper is same as it in the code? In code, 3072 refers to total tokens, corresponding to about real 32 examples each iteration.
a) Maybe 32(real batchsize)*8(Grad. Accu) is dominant factor? b) Our V100 machine (16G) can not process the 3072 tokens, so maybe 1024 tokens(about 8 real examples), 8 Gpus, 4(Grad. Accu) is another workable plan? c) Besides, the train-vqa-large-8gpu-adv.json you released can reproduce the paper result? Some parameters seem to be set differently from the paper (e.g. Adv .Lr ..)
We deeply hope to reproduce your best results in our limited resource scenario. Thank a lot.