Open LaCandela opened 4 years ago
I searched the learning rate locally (2.5e-4, 6.125e-5 for batch size 32), the performance is within the random noise (<0.4 COCO AP).
OK, thank you for your answer! Did you experiment with different batch sizes and up/downscaled learning rates with Adam to see if the linear scaling rule is true?
Hi! I have a question concerning the linear learning rate scaling that you are using. In the publication https://arxiv.org/abs/1706.02677 this scaling rule is only proven for SGD but you are using Adam. Did you do or do you know about any experiments that back up this approach?