yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

"ERROR CaffeOnSpark: Requested # of executors: " and rank 0 #237

Closed hyunjong closed 7 years ago

hyunjong commented 7 years ago

Currently I'm working on CaffeOnSpark with 9 machines as below. machine name : IP - (lab-machine01 : 10.1.97.2, lab-machine02: 10.1.97.3. ..., lab-machine09: 10.1.97.10) machine01 is in charge of master role, where "spark-submit" is executed, and the other 8 nodes, machine02-09 are workers. I ran a sample test, lenet training on mnist dataset, several times and it seems to perform fine as it went to the last stage(92th) and generated "mnist_lenet.model " file as output. Below I attached the test command and the simplified master log.


-----------------------------Two issues I have--------------------------------------- 1) mnist_lenet.model file is generated in one of the workers, not in master. 2) the training process terminates with Error "ERROR CaffeOnSpark: Requested # of executors: ..."


-----------------------------My analysis on these issues------------------------------------ My analysis on these issues is as below 1) I found a snippet from "./caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeProcessor.scala" as below. if (source.isTrain) { //start solvers w/ only rank 0 will save model ... And the masters log below also shows that one of the worker(lab-machine09 at this test) is set to rank0 after finishing job 2. The program continues to perform the training job after setting the ranks of the workers. I found the trained model file from the "lab-machine09" machine after the program terminated.

2) masters log below shows that 7 out of 8 workers got terminated with the message "Lost executor X on 10.1.97.7: Remote RPC client disassociated" at the end of the training process. Fortunately this error doesn't seem to affect the result as I, anyway, could obtain the trained model file. But I searched for the log of one of the terminated worker to know why , and the log says the reason as below _I0317 17:24:08.550050 31947 CaffeNet.cpp:325] Finetuning from /home/ubuntu/CaffeOnSpark/mnist_lenet.model F0317 17:24:08.550176 31947 io.cpp:54] Check failed: fd != -1 (-1 vs. -1) File not found: /home/ubuntu/CaffeOnSpark/mnistlenet.model Check failure stack trace: Yes, these 7 workers got terminated as they failed to pass the mnist_lenet.model check.


-----------------------------My questions----------------------------------------------- 1) What is the "rank 0" for? I ran the same test several times, and rank 0 was set to a random worker everytime. How can I make the master store the trained model? 2) 7 out of 8 workers get terminated as they, of course, don't have the trained model. Am I doing something wrong in the submit command? I modified nothing, but the train and test lmdb file paths in lenet_memory_train_test.prototxt. I testes the "--deploy-mode" with cluster and client, but both were same.

Any help would be really appreciated.


-----------------------------Test Code----------------------------------------------- ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER_URL} \ --deploy-mode client \ --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.executor.memory=2g \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \ -clusterSize ${SPARK_WORKER_INSTANCES} -devices 1 \ -connection ethernet \ -output file:${CAFFE_ON_SPARK}/lenet_features_result


-----------------------------Master(machine01) log -------------------------------- ~/CaffeOnSpark/data ~ Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties INFO SparkContext: Running Spark version 2.0.0 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO SecurityManager: Changing view acls to: ubuntu INFO SecurityManager: Changing modify acls to: ubuntu INFO SecurityManager: Changing view acls groups to: INFO SecurityManager: Changing modify acls groups to: INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() INFO Utils: Successfully started service 'sparkDriver' on port 53059.

...

INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.1.97.2:4040 INFO SparkContext: Added JAR file:/home/ubuntu/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at spark://10.1.97.2:53059/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp

...

INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.3:56851) with ID 7 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.10:54581) with ID 2 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.4:43817) with ID 6 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.9:52783) with ID 4 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.6:59073) with ID 1 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.7:57169) with ID 3 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.1.97.5:33535) with ID 0 INFO DataSource$: Source data layer:0 INFO LMDB: Batch size:64 INFO DataSource$: Source data layer:1 INFO LMDB: Batch size:100 INFO CaffeOnSpark: interleave INFO LmdbRDD: local LMDB path:/home/ubuntu/CaffeOnSpark/data/mnist_train_lmdb INFO LmdbRDD: 8 LMDB RDD partitions

...

INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.1.97.9, partition 0, PROCESS_LOCAL, 5291 bytes) INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.1.97.5, partition 1, PROCESS_LOCAL, 5308 bytes) INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.1.97.7, partition 2, PROCESS_LOCAL, 5308 bytes) INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.1.97.3, partition 3, PROCESS_LOCAL, 5308 bytes) INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.1.97.6, partition 4, PROCESS_LOCAL, 5308 bytes) INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.1.97.10, partition 5, PROCESS_LOCAL, 5308 bytes) INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.1.97.4, partition 6, PROCESS_LOCAL, 5308 bytes) INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.1.97.8, partition 7, PROCESS_LOCAL, 5308 bytes)

...

INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 2569 ms on 10.1.97.10 (1/8) INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 2569 ms on 10.1.97.8 (2/8) INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2592 ms on 10.1.97.7 (3/8) INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 2600 ms on 10.1.97.3 (4/8) INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 2640 ms on 10.1.97.6 (5/8) INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2749 ms on 10.1.97.9 (6/8) INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 2666 ms on 10.1.97.4 (7/8) INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2697 ms on 10.1.97.5 (8/8) INFO DAGScheduler: ResultStage 0 (count at CaffeOnSpark.scala:257) finished in 2.766 s INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool INFO DAGScheduler: Job 0 finished: count at CaffeOnSpark.scala:257, took 3.510765 s

...

INFO DAGScheduler: Job 2 finished: collect at CaffeOnSpark.scala:123, took 2.802059 s INFO CaffeOnSpark: rank = 0, address = ,lab-machine09:48645,lab-machine09:48645,lab-machine09:48645,lab-machine09:48645,lab-machine09:48645,lab-machine09:48645,lab-machine09:48645, hostname = lab-machine09 INFO CaffeOnSpark: rank = 1, address = lab-machine03:45879,,lab-machine03:45879,lab-machine03:45879,lab-machine03:45879,lab-machine03:45879,lab-machine03:45879,lab-machine03:45879, hostname = lab-machine03 INFO CaffeOnSpark: rank = 2, address = lab-machine05:42568,lab-machine05:42568,,lab-machine05:42568,lab-machine05:42568,lab-machine05:42568,lab-machine05:42568,lab-machine05:42568, hostname = lab-machine05 INFO CaffeOnSpark: rank = 3, address = lab-machine07:47924,lab-machine07:47924,lab-machine07:47924,,lab-machine07:47924,lab-machine07:47924,lab-machine07:47924,lab-machine07:47924, hostname = lab-machine07 INFO CaffeOnSpark: rank = 4, address = lab-machine08:45542,lab-machine08:45542,lab-machine08:45542,lab-machine08:45542,,lab-machine08:45542,lab-machine08:45542,lab-machine08:45542, hostname = lab-machine08 INFO CaffeOnSpark: rank = 5, address = lab-machine04:47940,lab-machine04:47940,lab-machine04:47940,lab-machine04:47940,lab-machine04:47940,,lab-machine04:47940,lab-machine04:47940, hostname = lab-machine04 INFO CaffeOnSpark: rank = 6, address = lab-machine02:56925,lab-machine02:56925,lab-machine02:56925,lab-machine02:56925,lab-machine02:56925,lab-machine02:56925,,lab-machine02:56925, hostname = lab-machine02 INFO CaffeOnSpark: rank = 7, address = lab-machine06:42552,lab-machine06:42552,lab-machine06:42552,lab-machine06:42552,lab-machine06:42552,lab-machine06:42552,lab-machine06:42552,, hostname = lab-machine06 INFO CaffeOnSpark: rank 0:lab-machine09

...

INFO TaskSetManager: Finished task 0.0 in stage 92.0 (TID 408) in 303 ms on 10.1.97.10 (1/8) ERROR TaskSchedulerImpl: Lost executor 3 on 10.1.97.7: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. WARN TaskSetManager: Lost task 2.0 in stage 92.0 (TID 410, 10.1.97.7): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. ERROR TaskSchedulerImpl: Lost executor 4 on 10.1.97.9: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. WARN TaskSetManager: Lost task 5.0 in stage 92.0 (TID 413, 10.1.97.9): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. INFO DAGScheduler: Executor lost: 3 (epoch 2) INFO TaskSetManager: Starting task 5.1 in stage 92.0 (TID 416, 10.1.97.10, partition 5, PROCESS_LOCAL, 5376 bytes) INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 416 on executor id: 2 hostname: 10.1.97.10.

...

INFO BlockManagerMaster: Removed 6 successfully in removeExecutor INFO DAGScheduler: Executor lost: 7 (epoch 8)

...

INFO TaskSetManager: Starting task 6.1 in stage 92.0 (TID 417, 10.1.97.10, partition 6, PROCESS_LOCAL, 5376 bytes) INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 417 on executor id: 2 hostname: 10.1.97.10. INFO TaskSetManager: Finished task 5.1 in stage 92.0 (TID 416) in 150 ms on 10.1.97.10 (2/8) INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170317161850-0004/6 is now EXITED (Command exited with code 134) INFO StandaloneSchedulerBackend: Executor app-20170317161850-0004/6 removed: Command exited with code 134 INFO BlockManagerMaster: Removal of executor 6 requested INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 6 INFO BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster. INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170317161850-0004/14 on worker-20170315230433-10.1.97.4-34449 (10.1.97.4:34449) with 1 cores INFO StandaloneSchedulerBackend: Granted executor ID app-20170317161850-0004/14 on hostPort 10.1.97.4:34449 with 1 cores, 2.0 GB RAM INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170317161850-0004/14 is now RUNNING

...

INFO TaskSetManager: Finished task 3.1 in stage 92.0 (TID 421) in 341 ms on 10.1.97.10 (7/8) INFO TaskSetManager: Finished task 2.1 in stage 92.0 (TID 422) in 460 ms on 10.1.97.10 (8/8) INFO DAGScheduler: ResultStage 92 (count at CaffeOnSpark.scala:456) finished in 2.416 s INFO TaskSchedulerImpl: Removed TaskSet 92.0, whose tasks have all completed, from pool INFO DAGScheduler: Job 51 finished: count at CaffeOnSpark.scala:456, took 2.467966 s ERROR CaffeOnSpark: Requested # of executors: 8 actual # of executors:1. Please try to set --conf spark.scheduler.maxRegisteredResourcesWaitingTime with a large value (default 30s) Exception in thread "main" java.lang.IllegalStateException: actual number of executors is not as expected at com.yahoo.ml.caffe.CaffeOnSpark.features2(CaffeOnSpark.scala:468) at com.yahoo.ml.caffe.CaffeOnSpark.features(CaffeOnSpark.scala:429) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:54) ...

junshi15 commented 7 years ago

Rank-0 is the first node Spark assigned to the job. It acts as a master, responsible for saving/snapshotting models, since there is no need for every worker to save the exactly the same model. If the model is saved to a shared place like HDFS location, multiple workers saving the same model can cause conflicts.

hyunjong commented 7 years ago

@junshi15 I see. I thought the machine that issues the "spark-submit" is designated to the master role. Thank you so much for your help.