yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Job hangs on ACCEPTED state.waiting for AM container to be allocated, launched and register with RM #24

Closed dejunzhang closed 8 years ago

dejunzhang commented 8 years ago

16/03/11 16:09:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/11 16:09:25 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/03/11 16:09:26 INFO yarn.Client: Requesting a new application from cluster with 0 NodeManagers 16/03/11 16:09:26 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/03/11 16:09:26 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 16/03/11 16:09:26 INFO yarn.Client: Setting up container launch context for our AM 16/03/11 16:09:26 INFO yarn.Client: Setting up the launch environment for our AM container 16/03/11 16:09:26 INFO yarn.Client: Preparing resources for our AM container 16/03/11 16:09:26 INFO yarn.Client: Uploading resource file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark-assembly-1.6.0-hadoop2.6.0.jar 16/03/11 16:09:43 INFO yarn.Client: Uploading resource file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark-examples-1.6.0-hadoop2.6.0.jar 16/03/11 16:09:53 INFO yarn.Client: Uploading resource file:/tmp/spark-186580ef-7b45-4d23-a810-8329df0d983e/spark_conf5049458426184257601.zip -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark_conf5049458426184257601.zip 16/03/11 16:09:54 INFO spark.SecurityManager: Changing view acls to: atlas 16/03/11 16:09:54 INFO spark.SecurityManager: Changing modify acls to: atlas 16/03/11 16:09:54 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(atlas); users with modify permissions: Set(atlas) 16/03/11 16:09:54 INFO yarn.Client: Submitting application 1 to ResourceManager 16/03/11 16:09:54 INFO impl.YarnClientImpl: Submitted application application_1457683710951_0001 16/03/11 16:09:55 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:55 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1457683794747 final status: UNDEFINED tracking URL: http://master:8088/proxy/application_1457683710951_0001/ user: atlas 16/03/11 16:09:56 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:57 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:58 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:59 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:00 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:01 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:02 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:03 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:04 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:05 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:06 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:07 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:08 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:09 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)

mriduljain commented 8 years ago

Could you paste the command arguments?

On Friday, March 11, 2016, dejunzhang notifications@github.com wrote:

16/03/11 16:09:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/11 16:09:25 INFO client.RMProxy: Connecting to ResourceManager at / 0.0.0.0:8032 16/03/11 16:09:26 INFO yarn.Client: Requesting a new application from cluster with 0 NodeManagers 16/03/11 16:09:26 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/03/11 16:09:26 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 16/03/11 16:09:26 INFO yarn.Client: Setting up container launch context for our AM 16/03/11 16:09:26 INFO yarn.Client: Setting up the launch environment for our AM container 16/03/11 16:09:26 INFO yarn.Client: Preparing resources for our AM container 16/03/11 16:09:26 INFO yarn.Client: Uploading resource file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark-assembly-1.6.0-hadoop2.6.0.jar 16/03/11 16:09:43 INFO yarn.Client: Uploading resource file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark-examples-1.6.0-hadoop2.6.0.jar 16/03/11 16:09:53 INFO yarn.Client: Uploading resource file:/tmp/spark-186580ef-7b45-4d23-a810-8329df0d983e/spark_conf5049458426184257601.zip -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark_conf5049458426184257601.zip 16/03/11 16:09:54 INFO spark.SecurityManager: Changing view acls to: atlas 16/03/11 16:09:54 INFO spark.SecurityManager: Changing modify acls to: atlas 16/03/11 16:09:54 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(atlas); users with modify permissions: Set(atlas) 16/03/11 16:09:54 INFO yarn.Client: Submitting application 1 to ResourceManager 16/03/11 16:09:54 INFO impl.YarnClientImpl: Submitted application application_1457683710951_0001 16/03/11 16:09:55 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:55 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1457683794747 final status: UNDEFINED tracking URL: http://master:8088/proxy/application_1457683710951_0001/ user: atlas 16/03/11 16:09:56 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:57 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:58 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:09:59 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:00 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:01 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:02 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:03 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:04 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:05 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:06 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:07 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:08 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED) 16/03/11 16:10:09 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)

— Reply to this email directly or view it on GitHub https://github.com/yahoo/CaffeOnSpark/issues/24.

nhe150 commented 8 years ago

you are short on resources. kill running using yarn application -kill other-appid, your new appication will be able to run.

mriduljain commented 8 years ago

That is what it looks like. Check the memory requested etc

On Friday, March 11, 2016, Norman He notifications@github.com wrote:

you are short on resources. kill running using yarn application -kill other-appid, your new appication will be able to run.

— Reply to this email directly or view it on GitHub https://github.com/yahoo/CaffeOnSpark/issues/24#issuecomment-195472552.

xiangqiaolxq commented 8 years ago

I see this message: 16/03/11 16:09:26 INFO yarn.Client: Requesting a new application from cluster with 0 NodeManagers Is this means theres is no nodemanager in your Yarn cluster,or which means no resources?

mriduljain commented 8 years ago

Could you try out the following suggestions in this thread: http://stackoverflow.com/questions/30828879/application-report-for-application-state-accepted-never-ends-for-spark-submi/34233499 Looks like you don't have enough resources

dejunzhang commented 8 years ago

@mriduljain, @nhe150 @xiangqiaolxq. Please let me introduce my envirement: I have 4 nodes. 3(slave1,slave2,slave3) of them have GPUs, the left one(as master) doesn't have GPU. So the master only regards as namenode. and slave* are datanodes. The configurations for hadoop and spark are the same for all the nodes. Especially, the ./hadoop-2.6.4/etc/hadoop/slaves configuration is : slave1 slave2 slave3 the ./scripts/core_sites.xml is as below:

fs.defaultFS hdfs://master:9000

On the master node. i run: ${HADOOP_HOME}/bin/hdfs namenode -format ${HADOOP_HOME}/sbin/start-dfs.sh ${HADOOP_HOME}/sbin/start-yarn.sh

And i can see the related processes are running. On master node: 7720 Jps 7216 SecondaryNameNode 6985 NameNode 7444 ResourceManager On slave nodes: 12735 Jps 12538 NodeManager 12319 DataNode

And following the toturial ,running below command:

8) Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection.

export SPARK_WORKER_INSTANCES=2 export DEVICES=1 hadoop fs -rm -f hdfs:///mnist.model hadoop fs -rm -r -f ${CAFFE_ON_SPARK}/mnist_features_result spark-submit --master yarn --deploy-mode cluster \ --num-executors ${SPARK_WORKER_INSTANCES} \ --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -devices ${DEVICES} \ -connection ethernet \ -model hdfs:///mnist.model \ -output hdfs:///mnist_features_result

Below is the datanode information. Node Last contact Admin State Capacity Used Non DFS Used Remaining Blocks Block pool used Failed Volumes Version slave1 0 In Service 437.45 GB 131.1 MB 41.79 GB 395.53 GB 5 131.1 MB (0.03%) 0 2.6.4 slave2 0 In Service 437.45 GB 61.22 MB 397.48 GB 39.91 GB 3 61.22 MB (0.01%) 0 2.6.4 slave3 0 In Service 437.45 GB 59.39 MB 285.35 GB 152.04 GB 2 59.39 MB (0.01%) 0 2.6.4

I didn't run any other application in this cluster.

dejunzhang commented 8 years ago

@mriduljain all 3 datanodes have 24GB RAM and 2 TITAN X 12GB GPU. and the namenode is just a quite normal desktop computers. And I also only use 3 datanode as a cluster, and the same problem appears.

anfeng commented 8 years ago

Do you see slave nodes from master UI? Please check master's web UI. If not, you need to look into Hadoop logs. It could be SSH issues from slaves to master.

Andy

On Sun, Mar 13, 2016 at 6:57 PM, dejunzhang notifications@github.com wrote:

@mriduljain https://github.com/mriduljain all 3 datanodes have 24GB RAM and 2 TITAN X 12GB GPU. and the namenode is just a quite normal desktop computers. And I also only use 3 datanode as a cluster, and the same problem appears.

— Reply to this email directly or view it on GitHub https://github.com/yahoo/CaffeOnSpark/issues/24#issuecomment-196104404.

dejunzhang commented 8 years ago

@anfeng, Thank you very much. From the master UI, i didn't see slave nodes from master's web UI. the problem is that i didn't set the resource manger configuration in yarn-site.xml. i solve the problem using below solution, now i can see slave nodes from the UI. http://stackoverflow.com/questions/32727675/slave-nodes-not-in-yarn-resourcemanager