Open lucaslie opened 2 years ago
Depending on the backend, distributed communication may only be supported on either CPU or GPU, see table here.
Right now, in comm.py communication is always done on the GPU, see here e.g.: https://github.com/zhijian-liu/torchpack/blob/d3fda521bc2e2684643a46103ecece816b53842b/torchpack/distributed/comm.py#L32-L34
comm.py
I would suggest considering the backend-specific device support for both allgather() and broadcast() to ensure the functions are usable across multiple backends.
allgather()
broadcast()
torch.distributed.broadcast_object_list and torch.distributed.all_gather_object might be a useful starting points for this.
torch.distributed.broadcast_object_list
torch.distributed.all_gather_object
Depending on the backend, distributed communication may only be supported on either CPU or GPU, see table here.
Right now, in
comm.py
communication is always done on the GPU, see here e.g.: https://github.com/zhijian-liu/torchpack/blob/d3fda521bc2e2684643a46103ecece816b53842b/torchpack/distributed/comm.py#L32-L34I would suggest considering the backend-specific device support for both
allgather()
andbroadcast()
to ensure the functions are usable across multiple backends.torch.distributed.broadcast_object_list
andtorch.distributed.all_gather_object
might be a useful starting points for this.