yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Lack of scalability #252

Open dimon777 opened 7 years ago

dimon777 commented 7 years ago

I tested CaffeOnSpark on 3 node Spark cluster, with one GPU per node. Here are the results for MNIST (wall time):

GPU mode: 3 executors: 183 seconds 1 executor, 75 seconds

CPU mode: 3 executors: 101 seconds 1 executor: 353 seconds

Conclusions:

Seems that solutions doesn't scale in GPU mode. Can someone explain why and suggest what to look for to improve this?

Thanks.

junshi15 commented 7 years ago

MNIST is too small.

Do the same comparison on a larger neural network and a larger database, for example, inception-v3 (or VGG16) on imagenet dataset.