yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 357 forks source link

Error of CaffeOnSpark on a Spark standalone cluster #233

Closed eaglew94 closed 7 years ago

eaglew94 commented 7 years ago

I want to train mnist example on 2 nodes, but don‘t know how to set the parameters: my command are as follow. Or How should i set the parameters?

${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.cores.max=2 \ --conf spark.task.cpus=1 \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -clusterSize 2 \ -devices 1 \ -connection ethernet \ -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \ -output file:${CAFFE_ON_SPARK}/lenet_features_result

and the error shows:

17/02/20 21:19:30 ERROR CaffeOnSpark: Requested # of executors: 2 actual # of executors:1. Please try to set --conf spark.scheduler.maxRegisteredResourcesWaitingTime with a large value (default 30s) Exception in thread "main" java.lang.IllegalStateException: actual number of executors is not as expected at com.yahoo.ml.caffe.CaffeOnSpark.features2(CaffeOnSpark.scala:468) at com.yahoo.ml.caffe.CaffeOnSpark.features(CaffeOnSpark.scala:429) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:54) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

arundasan91 commented 7 years ago

Hello @eaglecrown ,

Are both the nodes in your cluster configured as workers ?

eaglew94 commented 7 years ago

Hello @arundasan91 , The cluster consists of two nodes. Both the nodes are configured as workers, and master are started on one of the node.

arundasan91 commented 7 years ago

Could you please go to the Spark Master WebUI in your browser and confirm that you have two workers and one master launched ?

From Spark Documentation:

Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.

Thanks, Arun

eaglew94 commented 7 years ago

Spark Master WebUI shows that two workers are Launched. I wonder if i set the parameter correctly?

arundasan91 commented 7 years ago

If you have two workers, you should be doing this:

export SPARK_WORKER_INSTANCES=2

and values to these according to your system

export CORES_PER_WORKER=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

Thanks, Arun

dawuchen commented 7 years ago

Hello @arundasan91 ,I am very confused about the param of "clusterSize".In https://github.com/yahoo/CaffeOnSpark/wiki/CLI Andy Feng says that it is only used by Spark YARN clusters, can I also used it to specify the num of workers?

anfeng commented 7 years ago

Just want to clarify:

dawuchen commented 7 years ago

Now I know it.Thanks very much!

hyunjong commented 7 years ago

@dawuchen Hi, did you get this problem resolved? I'm also having the same issue now. I have 1 master and 8 workers, and the sample sample lenet training on mnist terminates with the messsage "Requested # of executors: 8 actual # of executors:1." In my case this error message, fortunately, comes after it generates the trained model file. I searched for the logs of the master and workers and found that the trained model file is generated random in one of the workers, not in the master. But ALL 8 workers check whether themselves have the trained model file (see the code below). Which means only one worker who stored the trained model file could pass the file existence check, and the other 7 workers got terminated as they don't have the file.

... _I0317 17:24:08.550050 31947 CaffeNet.cpp:325] Finetuning from /home/ubuntu/CaffeOnSpark/mnist_lenet.model F0317 17:24:08.550176 31947 io.cpp:54] Check failed: fd != -1 (-1 vs. -1) File not found: /home/ubuntu/CaffeOnSpark/mnistlenet.model Check failure stack trace: ...

junshi15 commented 7 years ago

@hyunjong All workers need access to the model/snapshot files to resume/finetune the model.

Let's say the model file is saved at hdfs:///path/to/model/file, then each worker will copy the file from HDFS to local directory.

If the model file path is given locally, such as file:///path/to/model/file, then no copying will happen since the file is local. However, you must make sure the model file exists on each worker node. otherwise, worker won't be able to find the model.

Only the rank-0 worker(master node) generates the model. If the model path is on HDFS, then the file will be on HDFS. If the path is local, then it will be on the rank-0 worker.

junshi15 commented 7 years ago

The original issue seems to be solved. I am closing it.