Open guyang88 opened 7 years ago
Assuming multiple-gpu per node and multiple nodes, there are two levels of exchange: Inside a node, each gpu computes its gradients based on its batch and send them to a root gpu. Th root gpu average them. Across the nodes, the root gpus send averaged gradients to a master node's root gpu, which average them, and update the weights. The weights are broadcast to each node's root gpu, then the root gpu broadcasts the weight inside the nodes.
All those need to be done synchronously. No gpu is allowed to run next batch unless everybody gets the updated weights.
@junshi15 thanks,but I have a question.why not use parameter server?which asynchronous technique can make training faster?
sync version is simple to implement and verify. We do not have need for async training at this moment. In addition, we are limited by our resource. Your contribution is welcome.
@junshi15 @guyang88 Excuse me, I've been paying attention to this problem recently.In the source code, caffe-distri/src/main/cpp/util/socket_sync_cpu.cpp, and, rdma_sync.cpp.It seems to pass data from the parameter server, slicing rather than full weights or gradients. Is that so?I'm a little confused now. Can you help me? Thank you!
@jacklonghui Regarding slicing, it is an efficient implementation of all-reduce. If all the clients send its gradients to one node, then that node will be a bottleneck. What's implemented in CaffeOnSpark is a ring algorithm, where each node sends and receives portion of the entire gradients.
@junshi15 ok,thank you! I got.
@junshi15 Hi,as you said above, I have several questions as follows: (1)Master node is a single node, which is mainly responsible for inter cluster scheduling, and does not do iterative training like work nodes? (2)In code https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/util/socket_sync.cpp, each node sends and receives a portion of the entire gradient, is the weight the same? (3)Besides, I'm still a little confused. Here, it seems that broadcast gradients and weights are broadcast in the manner of parallel transmission and reception by each node, rather than broadcast gradient and weight by the master node. Is that so?
1) yes, it does training as well. 2) everybody's gradient is different, since the gradients are calculated based on individual's mini batch. then the gradients are aggregated and applied to weights, at the end of an iteration, everybody has the same weights. 3) In this implementation, everybody is a master (of a portion of gradients/weights), and everybody is a slave (of the remaining portion of the gradients/weights).
@junshi15 ok, thank you! in this lines. "...The root gpus send averaged gradients to a master node's root gpu, which average them, and update the weights. The weights are broadcast to each node's root gpu, then the root gpu broadcasts the weight inside the nodes..."
Does the "master node" here exist for everybody? If not, then there is a "master node" that collects and processes the gradients that everybody sends, and broadcasts the weights to everybody. Where does "master node" exist?
The line you quote is conceptional true. What's implemented here is different. In this particular implementation, everybody is a master and a worker. So you can regard every node as a master node and a worker node. On every node, there are MasterBuffer and WorkerBuffer. When gradients are ready, this function is called. When an iteration starts, this function is called. Please examine the code for details.
@junshi15 ok, thank you!
@anfeng @junshi15 How does caffeonspark exchange and synchronize the each executor's parameters?