Open libeiUCAS opened 7 years ago
i met the same issue, i was using the Docker image.
@libeiUCAS , @GoodJoey ,
Please run the spark-submit
command from $CAFFE_ON_SPARK
folder.
A complete guide to get the docker container working:
While building the docker container, in the logs you should be able to see that training actually works:
Once you launch the docker container, verify it by running jps
. You should see similar output.
Do Step 7 in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn.
Make sure you edit the prototxt files; both data/lenet_memory_train_test.prototxt
and data/lenet_memory_solver.prototxt
.
cd CAFFE_ON_SPARK
Do Step 8 in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn.
Your training just started. You should get a similar output:
I just created my own docker container and checked this. The hadoop download links seem to be broken. I've corrected them in PR #280.
I get the same problem as #247 ,and change change the source location in lenet_memory_train_test to the hdfs path as @arundasan91 's suggestion. However ,i still meet the same problem. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
17/05/03 18:50:15 INFO yarn.Client: Application report for application_1493801577689_0009 (state: RUNNING) 17/05/03 18:50:16 INFO yarn.Client: Application report for application_1493801577689_0009 (state: FINISHED) 17/05/03 18:50:16 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.191.3 ApplicationMaster RPC port: 0 queue: default start time: 1493808511908 final status: FAILED tracking URL: http://sky:8088/proxy/application_1493801577689_0009/ user: hadoop Exception in thread "main" org.apache.spark.SparkException: Application application_1493801577689_0009 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1029) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1076) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/05/03 18:50:16 INFO util.ShutdownHookManager: Shutdown hook called 17/05/03 18:50:16 INFO util.ShutdownHookManager: Deleting directory /home/hadoop/deep_learning/spark-1.6.0-bin-hadoop2.6/spark-2136e9ab-1b64-4d32-85d0-a6eb6fce0ea1
///////////////////////////////////////////////////////////////////////////////////////////////////////// I have two machines . IP 192.168.191.2 is master ,32GB 8cores. IP 192.168.191.3 is slave 32GB 8cores.as step 8 say: export SPARK_WORKER_INSTANCES=2 export DEVICES=1 The error in logpage is "Diagnostics:User class threw exception: java.lang.IllegalStateException: actual number of executors is not as expected"
when i change the command "export SPARK_WORKER_INSTANCES=2 export DEVICES=2" The error in logpage is
"Diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, trc): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 46.3 GB of 4.2 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead. Driver stacktrace:"