zhijian-liu / torchpack

A neural network training interface based on PyTorch, with a focus on flexibility
https://pypi.org/project/torchpack/
MIT License
61 stars 15 forks source link

Multi Node training #46

Closed AlexIlis closed 6 months ago

AlexIlis commented 1 year ago

Can you suggest how to implement multi gpu - multi node training with torchpack ?

I have set -H ip1:gpus,ip2:gpus and launched the train from both the nodes, however they don't seem to be getting a handle of one another. What am I missing here ?

zhijian-liu commented 11 months ago

Could you try to SSH into ip1 and ip2? You need to make sure that these two machines can be SSH-ed into without password.