Open apli opened 7 years ago
First, distributed training does not help in all cases. As you add more and more nodes to the cluster, communication cost increases. This is especially true if your model is large.
Second, you did not mention the batch size. Maybe you were comparing apples and oranges. Let's say you set batch size = 32. With 4 executors (and 1 gpu per executor), you are getting effectively 4*32=128 batch size. So the 4-node cluster has as 4X work-load as 1-node cluster. If you set batch size = 32 for single-node, and batch_size = 8 for 4-node cluster, then it is a fair comparison. But In the latter case, communication becomes bottleneck since the GPUs are likely idle most of the time, waiting to be fed.
Thanks,@junshi15 I think I get what you mean. Anyway,with more executors,I could set bigger batch size(the actual batch size = batch size * num of executors) to make full use of GPUS comparing to single-node. Is that correct?
Another question, If I have two executers(1 gpu per executor), the gpu of one is idle and another is busy.Does the time cost of training depends mainly on the training time of the busy executer without consideration of communication?
This is synchronous training. The speed is limited by the slowest executor.
What's the main factor that affect the communication, the bandwith?
bandwith, latency, etc. depending on your network.
Just to clarify: Does the accuracy improves, when I don't decrease the batchsize but increase the number of executors? When I understood it correctly, more batches are processed then. Or is there any other measurable "benefit", when I don't deacrease the batchsize?
Training with cifar10 datasets following the steps in GetStarted_yarn: