yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Error in Train a DNN network using CaffeOnSpark with 2 Spark executors #226

Open malia05 opened 7 years ago

malia05 commented 7 years ago

Hi, after successful build of my CaffeOnSpark, ./spark-submit --master ${MASTER_URL} \

--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark  \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
    -train \
    -features accuracy,loss -label label \
    -conf lenet_memory_solver.prototxt \
-clusterSize ${SPARK_WORKER_INSTANCES} \
    -devices 1 \
-connection ethernet \
    -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
    -output file:${CAFFE_ON_SPARK}/lenet_features_result

An error displayed, which said: Error: Cannot load main class from JAR file:/data/lenet_memory_solver.prototxt,/data/lenet_memory_train_test.prototxt Run with --help for usage help or --verbose for debug output Can someone help me plz, and thanks so much

baristahell commented 7 years ago

Seems like it mixes up the prototxts and caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar What are your ENV variables? What kind of cluster are you working on? The default from the wiki page?

EDIT : I guess your CAFFE_ON_SPARK variable isn't set when it should be something like /opt/CaffeOnSpark