zhijian-liu / torchpack

A neural network training interface based on PyTorch, with a focus on flexibility
https://pypi.org/project/torchpack/
MIT License
61 stars 15 forks source link

`comm.py` should maybe consider backend-specific support of different devices #30

Open lucaslie opened 2 years ago

lucaslie commented 2 years ago

Depending on the backend, distributed communication may only be supported on either CPU or GPU, see table here.

Right now, in comm.py communication is always done on the GPU, see here e.g.: https://github.com/zhijian-liu/torchpack/blob/d3fda521bc2e2684643a46103ecece816b53842b/torchpack/distributed/comm.py#L32-L34

I would suggest considering the backend-specific device support for both allgather() and broadcast() to ensure the functions are usable across multiple backends.

torch.distributed.broadcast_object_list and torch.distributed.all_gather_object might be a useful starting points for this.