yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

MNIST example at one out of four worker nodes only #285

Closed romanzac closed 6 years ago

romanzac commented 6 years ago

Hello,

I run MNIST example in cluster mode at my 7 node YARN cluster. 4 nodes are with GPU as worker nodes. Unfortunately execution is always scheduled just for one worker node. Not always the same worker node is selected, therefore I assume all four nodes' configuration is ok. To change SPARK_WORKER_INSTANCES don't make any difference. Any ideas please ?

R.

export SPARK_WORKER_INSTANCES=4 export DEVICES=1 hadoop fs -rm -f hdfs:///mnist.model hadoop fs -rm -r -f hdfs:///mnist_features_result

spark-submit --master yarn --deploy-mode cluster \ --num-executors $SPARK_WORKER_INSTANCES \ --files /root/CaffeOnSpark/data/lenet_memory_solver.prototxt,/root/CaffeOnSpark/data/lenet_memory_train_test.prototxt \ --archives /root/CaffeOnSpark/CoS_libArchive.tar \ --conf spark.driver.extraLibraryPath=$LD_LIBRARY_PATH:./CoS_libArchive.tar \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./CoS_libArchive.tar \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -devices ${DEVICES} \ -connection ethernet \ -model hdfs:///mnist.model \ -output hdfs:///mnist_features_result

hadoop fs -ls hdfs:///mnist.model hadoop fs -cat hdfs:///mnist_features_result/*

junshi15 commented 6 years ago

It's a yarn problem. Yarn scheduled all workers on the same physical node. You need to tell yarn stop doing that.

One trick we used was executor memory. Let's say each node has 16GB memory, you can set the following in your spark-submit.

--config spark.executor.memory = 9g

Now, no two workers can be on the same node, otherwise, you need 18GB memory. Of course, there will be only 7GB left on each node (maybe less due to overhead).

romanzac commented 6 years ago

It worked for me. Thanks a lot. R.