Lack of scalability - Githubissues

yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.

Apache License 2.0

1.27k stars 358 forks source link

Lack of scalability #252

Open dimon777 opened 7 years ago

dimon777 commented 7 years ago

I tested CaffeOnSpark on 3 node Spark cluster, with one GPU per node. Here are the results for MNIST (wall time):

GPU mode: 3 executors: 183 seconds 1 executor, 75 seconds

CPU mode: 3 executors: 101 seconds 1 executor: 353 seconds

Conclusions:

Distributed CPU mode outperforms distributed GPU mode by a wide margin
Single executor in GPU mode outperforms distributed GPU configuration
Slowest configuration is single node CPU mode

Seems that solutions doesn't scale in GPU mode. Can someone explain why and suggest what to look for to improve this?

Thanks.

junshi15 commented 7 years ago

MNIST is too small.

Do the same comparison on a larger neural network and a larger database, for example, inception-v3 (or VGG16) on imagenet dataset.