It seems that current implementation does not support multi-gpu

yanxp / MetaR-CNN

Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning

https://yanxp.github.io/metarcnn.html

179 stars 23 forks source link

It seems that current implementation does not support multi-gpu #13

Closed zb1439 closed 5 years ago

zb1439 commented 5 years ago

HI, I am trying to reproduce your results on coco datasets. We are using Titan XP with 12GB memory on each chip thus running on multi GPUs is necessary for training on COCO with 60 ways and a considerable batch size. However, simply adding

fasterrcnn = nn.DataParallel(fasterrcnn).cuda()

does not work and an error is reported:

'AssertionError: Tensors not supported in scatter.'

How could we get this fixed?

zb1439 commented 5 years ago

In train_metarcnn.py, PRN_CLS is a list containing Long type Tensors instead of variables, which caused the error. I tried to fix your implementation to adapt to PRN_CLS being a list of Variables, however, there are more than I expected to modify to make the code totally fixed up. Probably I will work on it later and see if I can get this all done.

I am wondering if you really run both VOC and COCO training on 1 GPU and hope updated code be released (not only parallelism but your COCO dataset interface, few shot selection, mask branch, etc., at best :-) ). But thanks for releasing this code anyway.

zb1439 commented 5 years ago

After adding an additional dimension for support inputs and labels to enable scattering on multiple gpus, together with other fix, I could finally run on multiple GPUs, but with significantly increased training time. 100 iters with batch size = 4 and 15-way support in your original implementation costs 60 sec on Titan XP, while 100 iters with batch size = 4, 15-way support and #GPU = 4 (batch size = 1 for each gpu) costs 500+ sec.

Could you offer us your training settings and memory requirements for COCO dataset? According to our environment, training COCO dataset with 60-20 split on a single 12GB Titan XP is only capable for batch size = 1. Extending your implementation to support multi-gpu will make training time almost unacceptable. Looking forward to your reply.

yanxp commented 5 years ago

Yes, the implementation not support mutil-gpus. The second is like #14

DrugRui commented 4 years ago

Hello, I would like to ask you how to modify the code to use multigpu? I always get an error: RuntimeError: dimension worked as 0 but tensor has no dimensionsdimension worked as 0 but tensor has no dimensions. I tried to solve it but failed... @zb1439