Closed EckoTan0804 closed 3 years ago
Hi, @EckoTan0804. Sorry for the confusing GPU settings as I trained the models on so many different settings :)!
The correct settings should be:
2 x RTX2080Ti GPUs ->TP-R-A3 , 1 x RTX2080Ti GPU->TP-R-A4, 4 x TiTan XP GPUs -> (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs-> (TP-H-A5, TP-H-A6).
Note: I adjusted the batchsize to be suitable to fit the max capacity of GPUs memory. Here is the log of TP-R-A4.
Fixing the initial learning rate to 1e-4
may be a one-size-fits-all strategy. If you use 4 RTX 2080ti to train large models such as transpose-h-x, I suggest you keep the initial learning rate to 1e-4
. And if you train small models with large batchsize, I suggest you slightly enlarge the initial learning rate, such as2e-4
or 5e-4
, and enlarge the ended learning rate as well.
Thanks for your answer!
If I use smaller input image (HxW=128x96) and smaller heatmap (HxW=332x24), how should I adjust the learning rate properly?
I have not tried this. I suggest keeping the same learning rate of using 256x192 input resolution.
Thanks for your suggestion!
Hello. I have some questions regarding training with multiple GPUs.
GPU setting in config files
In README
However, it seems that this note does not match some config files in folder TransPose/experiments/coco/:
In TransPose/experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml, only 1 GPU is used (instead of 2) line 7:
In TransPose/experiments/coco/transpose_h/TP_H_w32_256x192_stage3_1_4_d64_h128_relu_enc4_mh1.yaml, 2 GPUs instead of 4
In TP_H_w48_256x192_stage3_1_4_d64_h128_relu_enc4_mh1.yaml, TP_H_w48_256x192_stage3_1_4_d96_h192_relu_enc4_mh1.yaml, TP_H_w48_256x192_stage3_1_4_d96_h192_relu_enc5_mh1.yaml, only 1 GPU instead of 4
Maybe the GPU setting is not correct in these config files?
Scaling the batch size and learning rate
As mentioned in #11,
Currently I can use 4 RTX2080Ti GPUs for training. Do you have any suggestion for scaling the batch size and learning rate by multiple GPUs training?
Many Thanks in advance!