Open githubier opened 7 years ago
I modify the example of spark submit in the wiki like the code below to use caffenet for training, spark-submit --master ${MASTER_URL} \ --files ${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet/solver.prototxt,${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf solver.prototxt \ -clusterSize ${SPARK_WORKER_INSTANCES} \ -devices 1 \ -connection ethernet \ -model file:${CAFFE_ON_SPARK}/data/myself/myself_caffenet.model \ -output file:${CAFFE_ON_SPARK}/data/myself/myself_result
uh, the solver,prototxt & the train_val.prototxt in the ${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet, and my data in the ${CAFFE_ON_SPARK}/data/myself, I wonder know whether my change is right, and when I use input the above code, it show me the error below: 17/01/11 14:27:19 INFO spark.SparkContext: Running Spark version 1.5.1 17/01/11 14:27:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/11 14:27:19 WARN spark.SparkConf: SPARK_WORKER_INSTANCES was detected (set to '1'). This is deprecated in Spark 1.0+.
Please instead use:
17/01/11 14:27:19 INFO spark.SecurityManager: Changing view acls to: master
17/01/11 14:27:19 INFO spark.SecurityManager: Changing modify acls to: master
17/01/11 14:27:19 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(master); users with modify permissions: Set(master)
17/01/11 14:27:20 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/01/11 14:27:20 INFO Remoting: Starting remoting
17/01/11 14:27:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.102:46792]
17/01/11 14:27:20 INFO util.Utils: Successfully started service 'sparkDriver' on port 46792.
17/01/11 14:27:20 INFO spark.SparkEnv: Registering MapOutputTracker
17/01/11 14:27:20 INFO spark.SparkEnv: Registering BlockManagerMaster
17/01/11 14:27:20 INFO storage.DiskBlockManager: Created local directory at /home/master/Downloads/spark_sdk/spark-1.5.1/blockmgr-0a1f42d3-9d6e-441d-b223-0f7b60df7607
17/01/11 14:27:20 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB
17/01/11 14:27:20 INFO spark.HttpFileServer: HTTP File server directory is /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/httpd-1aed9920-fae1-46b8-a4d1-66ffaba49548
17/01/11 14:27:20 INFO spark.HttpServer: Starting HTTP Server
17/01/11 14:27:20 INFO server.Server: jetty-8.y.z-SNAPSHOT
17/01/11 14:27:20 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:36673
17/01/11 14:27:20 INFO util.Utils: Successfully started service 'HTTP file server' on port 36673.
17/01/11 14:27:20 INFO spark.SparkEnv: Registering OutputCommitCoordinator
17/01/11 14:27:20 INFO server.Server: jetty-8.y.z-SNAPSHOT
17/01/11 14:27:20 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
17/01/11 14:27:20 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
17/01/11 14:27:20 INFO ui.SparkUI: Started SparkUI at http://192.168.1.102:4040
17/01/11 14:27:20 INFO spark.SparkContext: Added JAR file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at http://192.168.1.102:36673/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1484116040832
17/01/11 14:27:20 INFO util.Utils: Copying /home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/solver.prototxt to /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/userFiles-c70014b3-28ad-4b30-919f-35815b830b2f/solver.prototxt
17/01/11 14:27:20 INFO spark.SparkContext: Added file file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/solver.prototxt at http://192.168.1.102:36673/files/solver.prototxt with timestamp 1484116040918
17/01/11 14:27:20 INFO util.Utils: Copying /home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt to /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/userFiles-c70014b3-28ad-4b30-919f-35815b830b2f/train_val.prototxt
17/01/11 14:27:20 INFO spark.SparkContext: Added file file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt at http://192.168.1.102:36673/files/train_val.prototxt with timestamp 1484116040924
17/01/11 14:27:20 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Connecting to master spark://master:7077...
17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20170111142721-0004
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/0 on worker-20170111005531-192.168.1.104-50421 (192.168.1.104:50421) with 2 cores
17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/0 on hostPort 192.168.1.104:50421 with 2 cores, 1024.0 MB RAM
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/1 on worker-20170111005518-192.168.1.103-41380 (192.168.1.103:41380) with 1 cores
17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/1 on hostPort 192.168.1.103:41380 with 1 cores, 1024.0 MB RAM
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/2 on worker-20170111135519-192.168.1.102-38652 (192.168.1.102:38652) with 1 cores
17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/2 on hostPort 192.168.1.102:38652 with 1 cores, 1024.0 MB RAM
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/0 is now LOADING
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/2 is now LOADING
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/1 is now LOADING
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/0 is now RUNNING
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/1 is now RUNNING
17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/2 is now RUNNING
17/01/11 14:27:21 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44782.
17/01/11 14:27:21 INFO netty.NettyBlockTransferService: Server created on 44782
17/01/11 14:27:21 INFO storage.BlockManagerMaster: Trying to register BlockManager
17/01/11 14:27:21 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:44782 with 530.3 MB RAM, BlockManagerId(driver, 192.168.1.102, 44782)
17/01/11 14:27:21 INFO storage.BlockManagerMaster: Registered BlockManager
17/01/11 14:27:22 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.103:38869/user/Executor#-249835689]) with ID 1
17/01/11 14:27:22 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.103:37499 with 530.3 MB RAM, BlockManagerId(1, 192.168.1.103, 37499)
17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.102:38675/user/Executor#1450319069]) with ID 2
17/01/11 14:27:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:38538 with 530.3 MB RAM, BlockManagerId(2, 192.168.1.102, 38538)
17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.104:39565/user/Executor#-1870584609]) with ID 0
17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0
17/01/11 14:27:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.104:59533 with 530.3 MB RAM, BlockManagerId(0, 192.168.1.104, 59533)
Exception in thread "main" java.io.FileNotFoundException: solver.prototxt (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.
Can someone help me?
Were you able to run the examples in Wiki? You command appeared to be correct, but Spark was complaining about not able to find solver.prototxt.
Sure, I have run the example successfully. But I don't know why this command is error, and I have solver.prototxt in the path in my command.
Glad to know the examples worked for you. I don't know why your command failed.
Thank you for your help. I change my solver.prototxt and train_val.prototxt to the ${CAFFE_ON_SPARK}/data/, so the spark submit is: spark-submit --master ${MASTER_URL} \ --files ${CAFFE_ON_SPARK}/data/solver.prototxt,${CAFFE_ON_SPARK}/data/train_val.prototxt \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf solver.prototxt \ -clusterSize ${SPARK_WORKER_INSTANCES} \ -devices 1 \ -connection ethernet \ -model file:${CAFFE_ON_SPARK}/myself_caffenet.model \ -output file:${CAFFE_ON_SPARK}/myself_result
However, the there is also a error: 17/01/11 20:49:34 ERROR caffe.DataSource$: source_class must be defined for input data layer:Data Exception in thread "main" java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:103) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/11 20:49:34 INFO spark.SparkContext: Invoking stop() from shutdown hook
It make me sad, I don't know why it throw the NullPointerException.....
Should I use the caffenet_train_net.prototxt in the ${CAFFE_ON_SPARK}/data, instead of the train_val.prototxt in the ${CAFFE_ON_SPARK}/caffe_public/models/bvlc_reference_caffenet ? And whether should I change the mean in the caffenet_train_net.prototxt?
You did not define source_class? Depending on your source data format, you need to tell CaffeOnSpark about it, e.g. https://github.com/yahoo/CaffeOnSpark/blob/master/data/lenet_cos_train_test.prototxt#L10-L12
Hello, I'm a newer for CaffeOnSpark. I have a same question , How to use myself model to detect image. you have solve it? Could you give me some example? Thanks!
I want to use the caffenet to train my data, and I use it to train my data on Caffe before, but I don't know how to use this model in CaffeOnSpark. The wiki just show me how to train a DNN, so I want to know how I modify the spark submit order to use the caffenet model, or some other way to use it.