yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Exception in thread "main" org.apache.spark.SparkException: Application application_ finished with failed status #247 #254

Open libeiUCAS opened 7 years ago

libeiUCAS commented 7 years ago

I get the same problem as #247 ,and change change the source location in lenet_memory_train_test to the hdfs path as @arundasan91 's suggestion. However ,i still meet the same problem. //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 17/05/03 18:50:15 INFO yarn.Client: Application report for application_1493801577689_0009 (state: RUNNING) 17/05/03 18:50:16 INFO yarn.Client: Application report for application_1493801577689_0009 (state: FINISHED) 17/05/03 18:50:16 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.191.3 ApplicationMaster RPC port: 0 queue: default start time: 1493808511908 final status: FAILED tracking URL: http://sky:8088/proxy/application_1493801577689_0009/ user: hadoop Exception in thread "main" org.apache.spark.SparkException: Application application_1493801577689_0009 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1029) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1076) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/05/03 18:50:16 INFO util.ShutdownHookManager: Shutdown hook called 17/05/03 18:50:16 INFO util.ShutdownHookManager: Deleting directory /home/hadoop/deep_learning/spark-1.6.0-bin-hadoop2.6/spark-2136e9ab-1b64-4d32-85d0-a6eb6fce0ea1 ///////////////////////////////////////////////////////////////////////////////////////////////////////// I have two machines . IP 192.168.191.2 is master ,32GB 8cores. IP 192.168.191.3 is slave 32GB 8cores.

as step 8 say: export SPARK_WORKER_INSTANCES=2 export DEVICES=1 The error in logpage is "Diagnostics:User class threw exception: java.lang.IllegalStateException: actual number of executors is not as expected"

when i change the command "export SPARK_WORKER_INSTANCES=2 export DEVICES=2" The error in logpage is
"Diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, trc): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 46.3 GB of 4.2 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead. Driver stacktrace:"

GoodJoey commented 6 years ago

i met the same issue, i was using the Docker image.

arundasan91 commented 6 years ago

@libeiUCAS , @GoodJoey ,

Please run the spark-submit command from $CAFFE_ON_SPARK folder.

A complete guide to get the docker container working:

  1. While building the docker container, in the logs you should be able to see that training actually works:

    screen shot 2017-09-27 at 11 05 38 am
  2. Once you launch the docker container, verify it by running jps. You should see similar output.

    screen shot 2017-09-27 at 11 06 52 am
  3. Do Step 7 in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn.

  4. Make sure you edit the prototxt files; both data/lenet_memory_train_test.prototxt and data/lenet_memory_solver.prototxt.

    screen shot 2017-09-27 at 11 09 28 am
  5. cd CAFFE_ON_SPARK

  6. Do Step 8 in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn.

  7. Your training just started. You should get a similar output:

    screen shot 2017-09-27 at 11 44 20 am

I just created my own docker container and checked this. The hadoop download links seem to be broken. I've corrected them in PR #280.