yangsenius / TransPose

PyTorch Implementation for "TransPose: Keypoint localization via Transformer", ICCV 2021.
https://github.com/yangsenius/TransPose/releases/download/paper/transpose.pdf
MIT License
361 stars 58 forks source link

Multiple GPUs training #13

Closed EckoTan0804 closed 3 years ago

EckoTan0804 commented 3 years ago

Hello. I have some questions regarding training with multiple GPUs.

GPU setting in config files

In README

We trained our different models on different hardware platforms: 2 x RTX2080Ti GPUs (TP-R-A3, TP-R-A4), 4 x TiTan XP GPUs (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs (TP-H-A6).

However, it seems that this note does not match some config files in folder TransPose/experiments/coco/:

Maybe the GPU setting is not correct in these config files?

Scaling the batch size and learning rate

As mentioned in #11,

From my experience, the performances of transpose-r models are very sensitive to the initial learning rate. I did not train transpose-r-a4 on 4 or 8 GPUs. I suggest you increase the initial learning rate a little bit at such conditions (with larger batchsize).

Currently I can use 4 RTX2080Ti GPUs for training. Do you have any suggestion for scaling the batch size and learning rate by multiple GPUs training?

Many Thanks in advance!

yangsenius commented 3 years ago

Hi, @EckoTan0804. Sorry for the confusing GPU settings as I trained the models on so many different settings :)!

The correct settings should be:

2 x RTX2080Ti GPUs ->TP-R-A3 , 1 x RTX2080Ti GPU->TP-R-A4, 4 x TiTan XP GPUs -> (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs-> (TP-H-A5, TP-H-A6).
Note: I adjusted the batchsize to be suitable to fit the max capacity of GPUs memory. Here is the log of TP-R-A4.

Fixing the initial learning rate to 1e-4 may be a one-size-fits-all strategy. If you use 4 RTX 2080ti to train large models such as transpose-h-x, I suggest you keep the initial learning rate to 1e-4. And if you train small models with large batchsize, I suggest you slightly enlarge the initial learning rate, such as2e-4 or 5e-4, and enlarge the ended learning rate as well.

EckoTan0804 commented 3 years ago

Thanks for your answer!

If I use smaller input image (HxW=128x96) and smaller heatmap (HxW=332x24), how should I adjust the learning rate properly?

yangsenius commented 3 years ago

I have not tried this. I suggest keeping the same learning rate of using 256x192 input resolution.

EckoTan0804 commented 3 years ago

Thanks for your suggestion!