gpu parallel problem - Githubissues

1292765944 commented 7 years ago

Dear Wei Liu: Recently I have two problems with the code.

When I train the simple SSD 300300 network in default setting and use only 2 titan x GPUs, I specifically set gpus = "0,1". However, in training, the GPU-util of these two gpus are only 100% and 0% alternately. It seems two gpus are not processing in parallel. The total time to train SSD 300300 on pascal voc 2007 trainval + 2012 trainval with 120000 iters costs me nearly 48 hours just by rough estimate. So what the problem can it be? How much time do you need in training such a network?
Another question is that does it always reproduce 77.2% map on PASCAL VOC 2007 test under the default setting? I get slightly different results in same training setting. Is it because of the sampling strategy?

Looking forward to your early reply, Thanks!

weiliu89 commented 7 years ago

I guess that is an issue of Caffe's parallel training code. It is doing synchronize SGD.
What is your results? It is expected to have difference because of randomness. But as long as it is within acceptable range (e.g. 77.*), it should be fine.

1292765944 commented 7 years ago

@weiliu89 I'm just reproducing your code. The accuracy is just right. But I just expect to know the running time in your machine when training SSD_300*300 in PASCAL_VOC 2007 trainval + 2010 trainval in 120000 iters? Thank you!

GumpCode commented 7 years ago

@1292765944 hi, in caffe, if you use 2 GPU, they will wait for another to exchange the param, you can search the details

weiliu89 / caffe

gpu parallel problem #373