Closed Eric-Gty closed 1 year ago
I have not tried multi-GPU training. For me, using an A100 GPU, the main bottleneck was IO. So I would check if your GPU utils is 100% otherwise I don't think multi-GPU will help.
2hrs per epoch is faster than my training.
I have not tried multi-GPU training. For me, using an A100 GPU, the main bottleneck was IO. So I would check if your GPU utils is 100% otherwise I don't think multi-GPU will help.
2hrs per epoch is faster than my training.
Thanks for your answer, then I'll just follow the single card setting.
Best, Eric
Hi Alex,
Thanks for your work. I have a question regarding the setup of training the ArcticNet-SF with multiple GPU. I run the provided command for training it follow the CVPR setup, and it takes like 2 hours for single epoch, thus I would like to accelerate it.
However, if simply changing the code within the trainer function: https://github.com/zc-alexfan/arctic/blob/fc6f7d72aa3103481307333b2a942422c24fd65d/scripts_method/train.py#L46 to set devices number and strategy caused many problems. I think it's because there are many .to(device) operations within the data pre-processing code, so hard to directly run the code with multi-GPU setup.
Have you tried with multi-GPU training? If so, could you please provide more suggestions for this?