zc-alexfan / arctic

[CVPR 2023] Official repository for downloading, processing, visualizing, and training models on the ARCTIC dataset.
https://arctic.is.tue.mpg.de
Other
301 stars 18 forks source link

Training setup with multiple GPU #6

Closed Eric-Gty closed 1 year ago

Eric-Gty commented 1 year ago

Hi Alex,

Thanks for your work. I have a question regarding the setup of training the ArcticNet-SF with multiple GPU. I run the provided command for training it follow the CVPR setup, and it takes like 2 hours for single epoch, thus I would like to accelerate it.

However, if simply changing the code within the trainer function: https://github.com/zc-alexfan/arctic/blob/fc6f7d72aa3103481307333b2a942422c24fd65d/scripts_method/train.py#L46 to set devices number and strategy caused many problems. I think it's because there are many .to(device) operations within the data pre-processing code, so hard to directly run the code with multi-GPU setup.

Have you tried with multi-GPU training? If so, could you please provide more suggestions for this?

zc-alexfan commented 1 year ago

I have not tried multi-GPU training. For me, using an A100 GPU, the main bottleneck was IO. So I would check if your GPU utils is 100% otherwise I don't think multi-GPU will help.

2hrs per epoch is faster than my training.

Eric-Gty commented 1 year ago

I have not tried multi-GPU training. For me, using an A100 GPU, the main bottleneck was IO. So I would check if your GPU utils is 100% otherwise I don't think multi-GPU will help.

2hrs per epoch is faster than my training.

Thanks for your answer, then I'll just follow the single card setting.

Best, Eric