yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 357 forks source link

[Docker] Exception in thread "main" org.apache.spark.SparkException: Application application_1491947096571_0004 finished with failed status #247

Closed baoruxiao closed 7 years ago

baoruxiao commented 7 years ago

Build standalone cpu docker images, and run following:

docker run -it caffeonspark:cpu /etc/bootstrap.sh -bash

Followed 'GetStarted_yarn Step 7', and get following error: 2017-04-11 5 04 38

Following are the "env": 2017-04-11 5 05 53

Does anyone has same error when running docker image? I'm really new to Spark, Yarn and Hadoop.

arundasan91 commented 7 years ago

Hmm.. ! Before trying step7, do a cd $CAFFE_ON_SPARK. Maybe that is the issue.

I did the steps now (created a new image and everything) and it works fine for me.

screen shot 2017-04-11 at 8 26 17 pm screen shot 2017-04-11 at 8 32 08 pm screen shot 2017-04-11 at 8 32 18 pm
baoruxiao commented 7 years ago

Thanks @arundasan91 , the error pops out when I ran 'Step 8'--spark-submit. I've tried 'cd' to CAFFE_ON_SPARK, but still get the same error. Can you run mnist training flawlessly?

baoruxiao commented 7 years ago

Hi @arundasan91, I have reproduced the errors on different machines (cpu machine and machine with both cpu and gpu). I simple followed the instruction to build and run docker image/container and follow the Getstart_yarn....

arundasan91 commented 7 years ago

Please show the commands that you run. Also, have you changed the data/lenet_memory_solver.prototxt and data/lenet_memory_train_test.prototxt files with the correct hdfs path ?

arundasan91 commented 7 years ago

Please also make sure you have the datasets downloaded ( you should have already, please cross verify )

root@998ed7494366:/opt/CaffeOnSpark# hadoop fs -ls /projects/machine_learning/image_dataset/mnist_test_lmdb
17/04/12 16:06:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup   10338304 2017-04-12 01:25 /projects/machine_learning/image_dataset/mnist_test_lmdb/data.mdb
-rw-r--r--   1 root supergroup       8192 2017-04-12 01:25 /projects/machine_learning/image_dataset/mnist_test_lmdb/lock.mdb
baoruxiao commented 7 years ago

Yes, I changed 'lenet_memory_solver.protxt', but not change 'lenet_memory_train_test.protxt' (Do I need to?)

Also, I have datasets in hdfs ready: 2017-04-12 11 11 42

Following are my commands:

docker build -t caffeonspark:cpu standalone/cpu docker run -it caffeonspark:cpu /etc/bootstrap.sh -bash cd $CAFFE_ON_SPARK hadoop fs -mkdir -p /projects/machine_learning/image_dataset ${CAFFE_ON_SPARK}/scripts/setup-mnist.sh hadoop fs -put -f ${CAFFE_ON_SPARK}/data/mnist_*_lmdb hdfs:/projects/machine_learning/image_dataset/ vim data/lenet_memory_solver.protxt # change mode from GPU to CPU 2017-04-12 11 06 27

arundasan91 commented 7 years ago

Please change the source location in lenet_memory_train_test to the hdfs path. For example:

source_class: "com.yahoo.ml.caffe.LMDB"
  memory_data_param {
    source: "hdfs:/projects/machine_learning/image_dataset/mnist_train_lmdb"
    batch_size: 64
    channels: 1
    height: 28
    width: 28
    share_in_parallel: false
  }

Do this in test source also.

baoruxiao commented 7 years ago

Yes, this is the problem! Thanks!! and I will suggest to have this add to docker README. I will close this issue.

arundasan91 commented 7 years ago

Awesome. Please do. Please close the issue once you are confident.