yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 357 forks source link

How to use CaffeOnSpark #217

Open githubier opened 7 years ago

githubier commented 7 years ago

I want to use the caffenet to train my data, and I use it to train my data on Caffe before, but I don't know how to use this model in CaffeOnSpark. The wiki just show me how to train a DNN, so I want to know how I modify the spark submit order to use the caffenet model, or some other way to use it.

githubier commented 7 years ago

I modify the example of spark submit in the wiki like the code below to use caffenet for training, spark-submit --master ${MASTER_URL} \ --files ${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet/solver.prototxt,${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf solver.prototxt \ -clusterSize ${SPARK_WORKER_INSTANCES} \ -devices 1 \ -connection ethernet \ -model file:${CAFFE_ON_SPARK}/data/myself/myself_caffenet.model \ -output file:${CAFFE_ON_SPARK}/data/myself/myself_result

uh, the solver,prototxt & the train_val.prototxt in the ${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet, and my data in the ${CAFFE_ON_SPARK}/data/myself, I wonder know whether my change is right, and when I use input the above code, it show me the error below: 17/01/11 14:27:19 INFO spark.SparkContext: Running Spark version 1.5.1 17/01/11 14:27:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/11 14:27:19 WARN spark.SparkConf: SPARK_WORKER_INSTANCES was detected (set to '1'). This is deprecated in Spark 1.0+.

Please instead use:

17/01/11 14:27:19 INFO spark.SecurityManager: Changing view acls to: master 17/01/11 14:27:19 INFO spark.SecurityManager: Changing modify acls to: master 17/01/11 14:27:19 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(master); users with modify permissions: Set(master) 17/01/11 14:27:20 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/01/11 14:27:20 INFO Remoting: Starting remoting 17/01/11 14:27:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.102:46792] 17/01/11 14:27:20 INFO util.Utils: Successfully started service 'sparkDriver' on port 46792. 17/01/11 14:27:20 INFO spark.SparkEnv: Registering MapOutputTracker 17/01/11 14:27:20 INFO spark.SparkEnv: Registering BlockManagerMaster 17/01/11 14:27:20 INFO storage.DiskBlockManager: Created local directory at /home/master/Downloads/spark_sdk/spark-1.5.1/blockmgr-0a1f42d3-9d6e-441d-b223-0f7b60df7607 17/01/11 14:27:20 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB 17/01/11 14:27:20 INFO spark.HttpFileServer: HTTP File server directory is /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/httpd-1aed9920-fae1-46b8-a4d1-66ffaba49548 17/01/11 14:27:20 INFO spark.HttpServer: Starting HTTP Server 17/01/11 14:27:20 INFO server.Server: jetty-8.y.z-SNAPSHOT 17/01/11 14:27:20 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:36673 17/01/11 14:27:20 INFO util.Utils: Successfully started service 'HTTP file server' on port 36673. 17/01/11 14:27:20 INFO spark.SparkEnv: Registering OutputCommitCoordinator 17/01/11 14:27:20 INFO server.Server: jetty-8.y.z-SNAPSHOT 17/01/11 14:27:20 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 17/01/11 14:27:20 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 17/01/11 14:27:20 INFO ui.SparkUI: Started SparkUI at http://192.168.1.102:4040 17/01/11 14:27:20 INFO spark.SparkContext: Added JAR file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at http://192.168.1.102:36673/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1484116040832 17/01/11 14:27:20 INFO util.Utils: Copying /home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/solver.prototxt to /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/userFiles-c70014b3-28ad-4b30-919f-35815b830b2f/solver.prototxt 17/01/11 14:27:20 INFO spark.SparkContext: Added file file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/solver.prototxt at http://192.168.1.102:36673/files/solver.prototxt with timestamp 1484116040918 17/01/11 14:27:20 INFO util.Utils: Copying /home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt to /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/userFiles-c70014b3-28ad-4b30-919f-35815b830b2f/train_val.prototxt 17/01/11 14:27:20 INFO spark.SparkContext: Added file file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt at http://192.168.1.102:36673/files/train_val.prototxt with timestamp 1484116040924 17/01/11 14:27:20 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Connecting to master spark://master:7077... 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20170111142721-0004 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/0 on worker-20170111005531-192.168.1.104-50421 (192.168.1.104:50421) with 2 cores 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/0 on hostPort 192.168.1.104:50421 with 2 cores, 1024.0 MB RAM 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/1 on worker-20170111005518-192.168.1.103-41380 (192.168.1.103:41380) with 1 cores 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/1 on hostPort 192.168.1.103:41380 with 1 cores, 1024.0 MB RAM 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/2 on worker-20170111135519-192.168.1.102-38652 (192.168.1.102:38652) with 1 cores 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/2 on hostPort 192.168.1.102:38652 with 1 cores, 1024.0 MB RAM 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/0 is now LOADING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/2 is now LOADING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/1 is now LOADING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/0 is now RUNNING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/1 is now RUNNING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/2 is now RUNNING 17/01/11 14:27:21 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44782. 17/01/11 14:27:21 INFO netty.NettyBlockTransferService: Server created on 44782 17/01/11 14:27:21 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/01/11 14:27:21 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:44782 with 530.3 MB RAM, BlockManagerId(driver, 192.168.1.102, 44782) 17/01/11 14:27:21 INFO storage.BlockManagerMaster: Registered BlockManager 17/01/11 14:27:22 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.103:38869/user/Executor#-249835689]) with ID 1 17/01/11 14:27:22 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.103:37499 with 530.3 MB RAM, BlockManagerId(1, 192.168.1.103, 37499) 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.102:38675/user/Executor#1450319069]) with ID 2 17/01/11 14:27:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:38538 with 530.3 MB RAM, BlockManagerId(2, 192.168.1.102, 38538) 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.104:39565/user/Executor#-1870584609]) with ID 0 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0 17/01/11 14:27:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.104:59533 with 530.3 MB RAM, BlockManagerId(0, 192.168.1.104, 59533) Exception in thread "main" java.io.FileNotFoundException: solver.prototxt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at java.io.FileInputStream.(FileInputStream.java:101) at java.io.FileReader.(FileReader.java:58) at com.yahoo.ml.jcaffe.Utils.GetSolverParam(Utils.java:14) at com.yahoo.ml.caffe.Config.protoFile_$eq(Config.scala:64) at com.yahoo.ml.caffe.Config.(Config.scala:366) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:34) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/11 14:27:23 INFO spark.SparkContext: Invoking stop() from shutdown hook 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 17/01/11 14:27:23 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.1.102:4040 17/01/11 14:27:23 INFO scheduler.DAGScheduler: Stopping DAGScheduler 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down 17/01/11 14:27:23 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/01/11 14:27:23 INFO storage.MemoryStore: MemoryStore cleared 17/01/11 14:27:23 INFO storage.BlockManager: BlockManager stopped 17/01/11 14:27:23 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 17/01/11 14:27:23 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/01/11 14:27:23 INFO spark.SparkContext: Successfully stopped SparkContext 17/01/11 14:27:23 INFO util.ShutdownHookManager: Shutdown hook called 17/01/11 14:27:23 INFO util.ShutdownHookManager: Deleting directory /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e

Can someone help me?

junshi15 commented 7 years ago

Were you able to run the examples in Wiki? You command appeared to be correct, but Spark was complaining about not able to find solver.prototxt.

githubier commented 7 years ago

Sure, I have run the example successfully. But I don't know why this command is error, and I have solver.prototxt in the path in my command.

junshi15 commented 7 years ago

Glad to know the examples worked for you. I don't know why your command failed.

githubier commented 7 years ago

Thank you for your help. I change my solver.prototxt and train_val.prototxt to the ${CAFFE_ON_SPARK}/data/, so the spark submit is: spark-submit --master ${MASTER_URL} \ --files ${CAFFE_ON_SPARK}/data/solver.prototxt,${CAFFE_ON_SPARK}/data/train_val.prototxt \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf solver.prototxt \ -clusterSize ${SPARK_WORKER_INSTANCES} \ -devices 1 \ -connection ethernet \ -model file:${CAFFE_ON_SPARK}/myself_caffenet.model \ -output file:${CAFFE_ON_SPARK}/myself_result

However, the there is also a error: 17/01/11 20:49:34 ERROR caffe.DataSource$: source_class must be defined for input data layer:Data Exception in thread "main" java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:103) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/11 20:49:34 INFO spark.SparkContext: Invoking stop() from shutdown hook

It make me sad, I don't know why it throw the NullPointerException.....

githubier commented 7 years ago

Should I use the caffenet_train_net.prototxt in the ${CAFFE_ON_SPARK}/data, instead of the train_val.prototxt in the ${CAFFE_ON_SPARK}/caffe_public/models/bvlc_reference_caffenet ? And whether should I change the mean in the caffenet_train_net.prototxt?

junshi15 commented 7 years ago

You did not define source_class? Depending on your source data format, you need to tell CaffeOnSpark about it, e.g. https://github.com/yahoo/CaffeOnSpark/blob/master/data/lenet_cos_train_test.prototxt#L10-L12

rosszh commented 7 years ago

Hello, I'm a newer for CaffeOnSpark. I have a same question , How to use myself model to detect image. you have solve it? Could you give me some example? Thanks!