Open 1292765944 opened 7 years ago
It might be that the preprocess part is slow. The multi-gpu is same as Caffe's previous one (no NCCL).
Hello,
I believe one of the reasons for the slowdown with the most recent release is because some of the pre-processing code encodes and decodes images multiple times. I've modified the code so that once an image is decoded, it stays decoded. This has resulted in approximately a 2x speed up. Unfortunately I stupidly was working on a different branch when I found and fixed this, but the commits can be found here: https://github.com/dtmoodie/caffe/tree/sanghoon-dev_pvanet 5d34a32d15423d73490e103eed4eff7d8c8399da 5d34a32d15423d73490e103eed4eff7d8c8399da 5d34a32d15423d73490e103eed4eff7d8c8399da 5d34a32d15423d73490e103eed4eff7d8c8399da
Furthermore, this branch is a merge of nvidia/caffe which includes better multi gpu scaling: https://github.com/dtmoodie/caffe/tree/test_ssd_merge
With the https://github.com/dtmoodie/caffe/tree/sanghoon-dev_pvanet branch I can achieve ~ 50% gpu load on an 8 titan X pascal machine with a batch size of 8 images per gpu. I can do about 1.5 - 3 iterations per second which yields about 160 frames per second in training.
In my experiment, this version of caffe do not support multi-gpu training. The training time of two gpu(16 batchsize per gpu) does not reduce training time half on one gpu(32 batchsize per gpu). Does anyone encounter this problem?