Closed eaglew94 closed 7 years ago
Hello @eaglecrown ,
Are both the nodes in your cluster configured as workers ?
Hello @arundasan91 , The cluster consists of two nodes. Both the nodes are configured as workers, and master are started on one of the node.
Could you please go to the Spark Master WebUI in your browser and confirm that you have two workers and one master launched ?
From Spark Documentation:
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
Thanks, Arun
Spark Master WebUI shows that two workers are Launched. I wonder if i set the parameter correctly?
If you have two workers, you should be doing this:
export SPARK_WORKER_INSTANCES=2
and values to these according to your system
export CORES_PER_WORKER=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
Thanks, Arun
Hello @arundasan91 ,I am very confused about the param of "clusterSize".In https://github.com/yahoo/CaffeOnSpark/wiki/CLI Andy Feng says that it is only used by Spark YARN clusters, can I also used it to specify the num of workers?
Just want to clarify:
Now I know it.Thanks very much!
@dawuchen Hi, did you get this problem resolved? I'm also having the same issue now. I have 1 master and 8 workers, and the sample sample lenet training on mnist terminates with the messsage "Requested # of executors: 8 actual # of executors:1." In my case this error message, fortunately, comes after it generates the trained model file. I searched for the logs of the master and workers and found that the trained model file is generated random in one of the workers, not in the master. But ALL 8 workers check whether themselves have the trained model file (see the code below). Which means only one worker who stored the trained model file could pass the file existence check, and the other 7 workers got terminated as they don't have the file.
... _I0317 17:24:08.550050 31947 CaffeNet.cpp:325] Finetuning from /home/ubuntu/CaffeOnSpark/mnist_lenet.model F0317 17:24:08.550176 31947 io.cpp:54] Check failed: fd != -1 (-1 vs. -1) File not found: /home/ubuntu/CaffeOnSpark/mnistlenet.model Check failure stack trace: ...
@hyunjong All workers need access to the model/snapshot files to resume/finetune the model.
Let's say the model file is saved at hdfs:///path/to/model/file, then each worker will copy the file from HDFS to local directory.
If the model file path is given locally, such as file:///path/to/model/file, then no copying will happen since the file is local. However, you must make sure the model file exists on each worker node. otherwise, worker won't be able to find the model.
Only the rank-0 worker(master node) generates the model. If the model path is on HDFS, then the file will be on HDFS. If the path is local, then it will be on the rank-0 worker.
The original issue seems to be solved. I am closing it.
I want to train mnist example on 2 nodes, but don‘t know how to set the parameters: my command are as follow. Or How should i set the parameters?
${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.cores.max=2 \ --conf spark.task.cpus=1 \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -clusterSize 2 \ -devices 1 \ -connection ethernet \ -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \ -output file:${CAFFE_ON_SPARK}/lenet_features_result
and the error shows:
17/02/20 21:19:30 ERROR CaffeOnSpark: Requested # of executors: 2 actual # of executors:1. Please try to set --conf spark.scheduler.maxRegisteredResourcesWaitingTime with a large value (default 30s) Exception in thread "main" java.lang.IllegalStateException: actual number of executors is not as expected at com.yahoo.ml.caffe.CaffeOnSpark.features2(CaffeOnSpark.scala:468) at com.yahoo.ml.caffe.CaffeOnSpark.features(CaffeOnSpark.scala:429) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:54) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)