yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Core dump failures #289

Open ooyanglinoo opened 6 years ago

ooyanglinoo commented 6 years ago

If solver config file have some mistakes, cluster won't return failed soon, after a long time,return core dumps. How can I solve this problem.

junshi15 commented 6 years ago

fix the solver prototxt file, I suppose.

junshi15 commented 6 years ago

You could run the solver file on the single node version first, i.e. BVLC Caffe. Of course, you need to change the network prototxt file accordingly (switch out the data layer, etc.). If the single node version works, then you can try the grid version (switch back in the data layer, etc.).

ooyanglinoo commented 6 years ago

Is there possible to solve the coredump problem by changing the code of CaffeProcessor of CaffeOnSpark?