How does caffeonspark exchange and synchronize the each executor's parameters? - Githubissues

yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.

Apache License 2.0

1.27k stars 357 forks source link

How does caffeonspark exchange and synchronize the each executor's parameters? #262

Open guyang88 opened 7 years ago

guyang88 commented 7 years ago

@anfeng @junshi15 How does caffeonspark exchange and synchronize the each executor's parameters?

junshi15 commented 7 years ago

Assuming multiple-gpu per node and multiple nodes, there are two levels of exchange: Inside a node, each gpu computes its gradients based on its batch and send them to a root gpu. Th root gpu average them. Across the nodes, the root gpus send averaged gradients to a master node's root gpu, which average them, and update the weights. The weights are broadcast to each node's root gpu, then the root gpu broadcasts the weight inside the nodes.

All those need to be done synchronously. No gpu is allowed to run next batch unless everybody gets the updated weights.

guyang88 commented 7 years ago

@junshi15 thanks,but I have a question.why not use parameter server？which asynchronous technique can make training faster?

junshi15 commented 7 years ago

sync version is simple to implement and verify. We do not have need for async training at this moment. In addition, we are limited by our resource. Your contribution is welcome.

jacklonghui commented 7 years ago

@junshi15 @guyang88 Excuse me, I've been paying attention to this problem recently.In the source code, caffe-distri/src/main/cpp/util/socket_sync_cpu.cpp, and, rdma_sync.cpp.It seems to pass data from the parameter server, slicing rather than full weights or gradients. Is that so?I'm a little confused now. Can you help me? Thank you!

junshi15 commented 7 years ago

@jacklonghui Regarding slicing, it is an efficient implementation of all-reduce. If all the clients send its gradients to one node, then that node will be a bottleneck. What's implemented in CaffeOnSpark is a ring algorithm, where each node sends and receives portion of the entire gradients.

jacklonghui commented 7 years ago

@junshi15 ok,thank you! I got.

jacklonghui commented 7 years ago

@junshi15 Hi，as you said above, I have several questions as follows：（1）Master node is a single node, which is mainly responsible for inter cluster scheduling, and does not do iterative training like work nodes? （2）In code https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/util/socket_sync.cpp， each node sends and receives a portion of the entire gradient, is the weight the same? （3）Besides, I'm still a little confused. Here, it seems that broadcast gradients and weights are broadcast in the manner of parallel transmission and reception by each node, rather than broadcast gradient and weight by the master node. Is that so?

junshi15 commented 7 years ago

1) yes, it does training as well. 2) everybody's gradient is different, since the gradients are calculated based on individual's mini batch. then the gradients are aggregated and applied to weights, at the end of an iteration, everybody has the same weights. 3) In this implementation, everybody is a master (of a portion of gradients/weights), and everybody is a slave (of the remaining portion of the gradients/weights).

jacklonghui commented 7 years ago

@junshi15 ok, thank you! in this lines. "...The root gpus send averaged gradients to a master node's root gpu, which average them, and update the weights. The weights are broadcast to each node's root gpu, then the root gpu broadcasts the weight inside the nodes..."

Does the "master node" here exist for everybody? If not, then there is a "master node" that collects and processes the gradients that everybody sends, and broadcasts the weights to everybody. Where does "master node" exist?

junshi15 commented 7 years ago

The line you quote is conceptional true. What's implemented here is different. In this particular implementation, everybody is a master and a worker. So you can regard every node as a master node and a worker node. On every node, there are MasterBuffer and WorkerBuffer. When gradients are ready, this function is called. When an iteration starts, this function is called. Please examine the code for details.

jacklonghui commented 7 years ago

@junshi15 ok， thank you!