yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

How train DNN network using CaffeOnSpark #227

Open malia05 opened 7 years ago

malia05 commented 7 years ago

I successfully installed spark-1.6.1-bin-hadoop2.4, CaffeOnSpark and mnist dataset, then I Adjusted ${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt to use absolute paths, such as. "file:/home/inf/CaffeOnSpark/caffe-public/examples/mnist/mnist_train_lmdb/" "file:/home/inf/CaffeOnSpark/caffe-public/examples/mnist/mnist_trest_lmdb/" My problem is how train DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection? is it necessary to configure a file of Spark "spark-env" with CaffeOnSpark? I submitted in mode standalone to train DNN using Mnist data, I used this instruction under Spark: ./bin/spark-submit --master local[4] --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" --conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -connection ethernet -model file:${CAFFE_ON_SPARK}/mnist_lenet.model -output file:${CAFFE_ON_SPARK}/lenet_features_result I get this message: Warning: Local jar /home/inf/spark-1.6.1-bin-hadoop2.4/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar does not exist, skipping. java.lang.ClassNotFoundException: com.yahoo.ml.caffe.CaffeOnSpark at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:174) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Plz, can any one clarify this problem and thanks so much...

arundasan91 commented 7 years ago

Hello @malia05 ,

Question: Are you running CoS in a Mac ?

Since you are getting this:

Warning: Local jar /home/inf/spark-1.6.1-bin-hadoop2.4/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar does not exist

I am pretty sure that you have some wrongly assigned PATH's. From the logs I understand that you installed CoS into /home/inf/CaffeOnSpark but the line above say's it's looking for CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar file inside /home/inf/spark-1.6.1-bin-hadoop2.4/ folder which is wrong.

Please check your CAFFE_ON_SPARK, DYLD_LIBRARY_PATH variables. Since you are using DYLD_LIBRARY_PATH I am assuming that your machine is a Mac. Please verify that you are doing this:

pushd ${CAFFE_ON_SPARK}/data
rm -rf ${CAFFE_ON_SPARK}/mnist_lenet.model
rm -rf ${CAFFE_ON_SPARK}/lenet_features_result
spark-submit --master ${MASTER_URL} \
    --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" \
    --conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
    -clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices 1 \
    -connection ethernet \
        -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
        -output file:${CAFFE_ON_SPARK}/lenet_features_result
ls -l ${CAFFE_ON_SPARK}/mnist_lenet.model
cat ${CAFFE_ON_SPARK}/lenet_features_result/*