Open jacksonsc007 opened 2 months ago
Hello @jacksonsc007 , thank you for your question.
K
times, you can set lr
to sqrt(k)
times to keep the variance unchanged, or set it to k
times according to the linear scaling raw
. In practice, the latter is more commonly used.We use lr=1e-4
for total_batch_size=10
, so you should use lr=1.6e-4
for total_batch_size=16
and lr=2e-5
for total_batch_size=2
to achieve a close performance.
Thanks for your prompt reply. I will try your suggestions and report the results later.
By the way, could you show me the relative papers for the learning rate rule you just mentioned?
The linear scaling rule is mentioned in this paper: https://arxiv.org/abs/1706.02677
Question
Hi, @xiuqhou Thanks for your enlightening work. I came across some questions while reproducing your work.
How many gpus did you use to train the model?
Need I change the initial learning rate if I adopt different total batchsize (
num_gpus * batchsize_per_gpu
). Is there a policy to make final performance impermeable to the total batchsize?In my personal experiments, model performance are not consistent among different total_batch_size. I experimented with 1x2 (1gpu, 2 images per gpu) and 4x4 (4gpus, 4 images per gpu) settings, and the initial learning rate is same. But results show that there is a non-trivial gaps between them. (4x4 setting lags behind 1x2 setting with 2 AP)
Best regards
Additional
No response