Learning rate for training

xiuqhou / Salience-DETR

[CVPR 2024] Official implementation of the paper "Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement"

https://arxiv.org/abs/2403.16131

Apache License 2.0

121 stars 7 forks source link

Learning rate for training #21

Open jacksonsc007 opened 2 months ago

jacksonsc007 commented 2 months ago

Question

Hi, @xiuqhou Thanks for your enlightening work. I came across some questions while reproducing your work.

How many gpus did you use to train the model?
Need I change the initial learning rate if I adopt different total batchsize (num_gpus * batchsize_per_gpu). Is there a policy to make final performance impermeable to the total batchsize?

In my personal experiments, model performance are not consistent among different total_batch_size. I experimented with 1x2 (1gpu, 2 images per gpu) and 4x4 (4gpus, 4 images per gpu) settings, and the initial learning rate is same. But results show that there is a non-trivial gaps between them. (4x4 setting lags behind 1x2 setting with 2 AP)

Best regards

Additional

No response

xiuqhou commented 2 months ago

Hello @jacksonsc007 , thank you for your question.

We use 2 * A800 gpus to train the model. The batch_size on each gpu is 5, so the total batch_size is 10. The learning rate is set to 1e-4.
There are two policies to change the learning rate according to the total batchsize. If batch_size increases K times, you can set lr to sqrt(k) times to keep the variance unchanged, or set it to k times according to the linear scaling raw. In practice, the latter is more commonly used.

We use lr=1e-4 for total_batch_size=10, so you should use lr=1.6e-4 for total_batch_size=16 and lr=2e-5 for total_batch_size=2 to achieve a close performance.

jacksonsc007 commented 2 months ago

Thanks for your prompt reply. I will try your suggestions and report the results later.

By the way, could you show me the relative papers for the learning rate rule you just mentioned?

xiuqhou commented 2 months ago

The linear scaling rule is mentioned in this paper: https://arxiv.org/abs/1706.02677