yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

How can we update the network weight when the parameters are synchronized? #267

Open jacklonghui opened 7 years ago

jacklonghui commented 7 years ago

Hi, When the parameters are synchronized, each node sends the weights in its own parameter server to other nodes and receives the weights sent from all other nodes. now, each node has the same cache, storing the weight of the number of nodes. We know that CaffeOnSpark will only retain a weight, as an update of the weight value, which will be used in the forward propagation of the next iteration. Then, how do you choose a weight update from the weight of the number of nodes as the weight of the next iteration? Of course, perhaps I understand the weight of each node's sending and receiving is wrong. My view comes from the source code CaffeOnSpark/caffe-distri/include/util/* How does CaffeOnSpark update the weight? Please help me, thank you!

junshi15 commented 7 years ago

https://github.com/yahoo/CaffeOnSpark/issues/262

jacklonghui commented 7 years ago

@junshi15 Thank you for your answer. Now I know the way to synchronize the parameters. Can you tell me the location of the detailed source code for parameter synchronization? I want to combine source code to understand parameter synchronization. I'm interested in the synchronization of CaffeOnSpark. About the parameters synchronization, I read the source code of CaffeOnSpark/caffe-distri/include/util/ and CaffeOnSpark/caffe-distri/src/main/cpp/util/, understand the connection between nodes and communication

junshi15 commented 7 years ago

https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/util/rdma_sync.cpp https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/util/socket_sync.cpp https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/util/socket_sync_cpu.cpp

look for on_gradients_ready() and on_start()

jacklonghui commented 7 years ago

@junshi15 Ok, thanks! May I send an e-mail to you in Chinese? Because I have some problems that have been bothering me and cannot express myself accurately in English. In addition, from this part of the source code, I drew a graph of the parameters synchronization between the nodes. I hope you can help me correct it. If possible, please send your email address to my email address at 2535113033@qq.com.