Closed romanzac closed 7 years ago
It's a yarn problem. Yarn scheduled all workers on the same physical node. You need to tell yarn stop doing that.
One trick we used was executor memory. Let's say each node has 16GB memory, you can set the following in your spark-submit.
--config spark.executor.memory = 9g
Now, no two workers can be on the same node, otherwise, you need 18GB memory. Of course, there will be only 7GB left on each node (maybe less due to overhead).
It worked for me. Thanks a lot. R.
Hello,
I run MNIST example in cluster mode at my 7 node YARN cluster. 4 nodes are with GPU as worker nodes. Unfortunately execution is always scheduled just for one worker node. Not always the same worker node is selected, therefore I assume all four nodes' configuration is ok. To change SPARK_WORKER_INSTANCES don't make any difference. Any ideas please ?
R.
export SPARK_WORKER_INSTANCES=4 export DEVICES=1 hadoop fs -rm -f hdfs:///mnist.model hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster \ --num-executors $SPARK_WORKER_INSTANCES \ --files /root/CaffeOnSpark/data/lenet_memory_solver.prototxt,/root/CaffeOnSpark/data/lenet_memory_train_test.prototxt \ --archives /root/CaffeOnSpark/CoS_libArchive.tar \ --conf spark.driver.extraLibraryPath=$LD_LIBRARY_PATH:./CoS_libArchive.tar \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./CoS_libArchive.tar \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -devices ${DEVICES} \ -connection ethernet \ -model hdfs:///mnist.model \ -output hdfs:///mnist_features_result
hadoop fs -ls hdfs:///mnist.model hadoop fs -cat hdfs:///mnist_features_result/*