Multiple GPUs training - Githubissues

EckoTan0804 commented 3 years ago

Hello. I have some questions regarding training with multiple GPUs.

GPU setting in config files

In README

We trained our different models on different hardware platforms: 2 x RTX2080Ti GPUs (TP-R-A3, TP-R-A4), 4 x TiTan XP GPUs (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs (TP-H-A6).

However, it seems that this note does not match some config files in folder TransPose/experiments/coco/:

In TransPose/experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml, only 1 GPU is used (instead of 2) line 7:
```
GPUS: (0,)
```
In TransPose/experiments/coco/transpose_h/TP_H_w32_256x192_stage3_1_4_d64_h128_relu_enc4_mh1.yaml, 2 GPUs instead of 4
In TP_H_w48_256x192_stage3_1_4_d64_h128_relu_enc4_mh1.yaml, TP_H_w48_256x192_stage3_1_4_d96_h192_relu_enc4_mh1.yaml, TP_H_w48_256x192_stage3_1_4_d96_h192_relu_enc5_mh1.yaml, only 1 GPU instead of 4

Maybe the GPU setting is not correct in these config files?

Scaling the batch size and learning rate

As mentioned in #11,

From my experience, the performances of transpose-r models are very sensitive to the initial learning rate. I did not train transpose-r-a4 on 4 or 8 GPUs. I suggest you increase the initial learning rate a little bit at such conditions (with larger batchsize).

Currently I can use 4 RTX2080Ti GPUs for training. Do you have any suggestion for scaling the batch size and learning rate by multiple GPUs training?

Many Thanks in advance!

yangsenius commented 3 years ago

Hi, @EckoTan0804. Sorry for the confusing GPU settings as I trained the models on so many different settings :)!

The correct settings should be:

2 x RTX2080Ti GPUs ->TP-R-A3 , 1 x RTX2080Ti GPU->TP-R-A4, 4 x TiTan XP GPUs -> (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs-> (TP-H-A5, TP-H-A6).
Note: I adjusted the batchsize to be suitable to fit the max capacity of GPUs memory. Here is the log of TP-R-A4.

Fixing the initial learning rate to 1e-4 may be a one-size-fits-all strategy. If you use 4 RTX 2080ti to train large models such as transpose-h-x, I suggest you keep the initial learning rate to 1e-4. And if you train small models with large batchsize, I suggest you slightly enlarge the initial learning rate, such as2e-4 or 5e-4, and enlarge the ended learning rate as well.

EckoTan0804 commented 3 years ago

Thanks for your answer!

If I use smaller input image (HxW=128x96) and smaller heatmap (HxW=332x24), how should I adjust the learning rate properly?

yangsenius commented 3 years ago

I have not tried this. I suggest keeping the same learning rate of using 256x192 input resolution.

EckoTan0804 commented 3 years ago

Thanks for your suggestion!

yangsenius / TransPose

Multiple GPUs training #13

GPU setting in config files

Scaling the batch size and learning rate