Training getting slower with more more Spark executors?

yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.

Apache License 2.0

1.27k stars 358 forks source link

Training getting slower with more more Spark executors? #244

Open apli opened 7 years ago

apli commented 7 years ago

Training with cifar10 datasets following the steps in GetStarted_yarn:

with 1 executor, it take 7 minites,
with 2, it takes 12 minites
with 4 it takes 23 minites

junshi15 commented 7 years ago

First, distributed training does not help in all cases. As you add more and more nodes to the cluster, communication cost increases. This is especially true if your model is large.

Second, you did not mention the batch size. Maybe you were comparing apples and oranges. Let's say you set batch size = 32. With 4 executors (and 1 gpu per executor), you are getting effectively 4*32=128 batch size. So the 4-node cluster has as 4X work-load as 1-node cluster. If you set batch size = 32 for single-node, and batch_size = 8 for 4-node cluster, then it is a fair comparison. But In the latter case, communication becomes bottleneck since the GPUs are likely idle most of the time, waiting to be fed.

apli commented 7 years ago

Thanks,@junshi15 I think I get what you mean. Anyway,with more executors,I could set bigger batch size(the actual batch size = batch size * num of executors) to make full use of GPUS comparing to single-node. Is that correct?

apli commented 7 years ago

Another question, If I have two executers(1 gpu per executor), the gpu of one is idle and another is busy.Does the time cost of training depends mainly on the training time of the busy executer without consideration of communication?

junshi15 commented 7 years ago

This is synchronous training. The speed is limited by the slowest executor.

apli commented 7 years ago

What's the main factor that affect the communication, the bandwith?

junshi15 commented 7 years ago

bandwith, latency, etc. depending on your network.

mumlax commented 6 years ago

Just to clarify: Does the accuracy improves, when I don't decrease the batchsize but increase the number of executors? When I understood it correctly, more batches are processed then. Or is there any other measurable "benefit", when I don't deacrease the batchsize?

junshi15 commented 6 years ago

If you fixe the batch size in the prototxt file, but increase number of executors, you process more images per batch. It is not clear you will get better accuracy. There are many things you want to tune. Folks at Facebook manage to do just that.