Closed chen-judge closed 4 years ago
Hi @chen-judge!
Thank you for asking! Sorry that it's indeed a bug caused by the .cuda()
operation, which loads data to GPU 0 by default.
Currently, this code does not support multi-GPU training mainly because of the use of ClassAwareSampler
. It is possible to write a DistributedClassAwareSampler
to properly distribute the samples to different devices while maintaining the class-aware sampling strategy, but it's not necessary for these datasets since they're rather small and fast to train on.
Hi, I tried to run
python tools/train.py configs/coco/LT_resnet50_pfc_DB.py --gpus 2
and met some bugs:
File "/data2/cjq/DistributionBalancedLoss/mllt/models/losses/resample_loss.py", line 164, in rebalance_weight repeat_rate = torch.sum( gt_labels.float() * self.freq_inv, dim=1, keepdim=True) RuntimeError: expected device cuda:1 and dtype Float but got device cuda:0 and dtype Float
Does this code support multi-GPU training? Have you ever tried, or I should fix these bugs by myself?
Thanks for you code !