wutong16 / DistributionBalancedLoss

[ ECCV 2020 Spotlight ] Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets"
362 stars 46 forks source link

Support multi-GPU training? #4

Closed chen-judge closed 4 years ago

chen-judge commented 4 years ago

Hi, I tried to run python tools/train.py configs/coco/LT_resnet50_pfc_DB.py --gpus 2

and met some bugs:

File "/data2/cjq/DistributionBalancedLoss/mllt/models/losses/resample_loss.py", line 164, in rebalance_weight repeat_rate = torch.sum( gt_labels.float() * self.freq_inv, dim=1, keepdim=True) RuntimeError: expected device cuda:1 and dtype Float but got device cuda:0 and dtype Float

Does this code support multi-GPU training? Have you ever tried, or I should fix these bugs by myself?

Thanks for you code !

wutong16 commented 4 years ago

Hi @chen-judge!

Thank you for asking! Sorry that it's indeed a bug caused by the .cuda() operation, which loads data to GPU 0 by default. Currently, this code does not support multi-GPU training mainly because of the use of ClassAwareSampler. It is possible to write a DistributedClassAwareSampler to properly distribute the samples to different devices while maintaining the class-aware sampling strategy, but it's not necessary for these datasets since they're rather small and fast to train on.