yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.88k stars 939 forks source link

the job stop and don't move #162

Closed lzzzzzzzzz closed 6 years ago

lzzzzzzzzz commented 7 years ago

like i stopped and can't do nothing,the executors added and it run. foreach at partition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:53(job 0) collect at pythonrdd:scala:453(job1) and then it stop and do nothing. I want to know what's wrong and I'm appreciated if you can help me. By the time I will look around the solved questions,mybe I can find a same situation

2017-11-06 22-23-17 2017-11-06 22-23-51

lzzzzzzzzz commented 7 years ago

I will be appreciated if someone can help me...

zerodarkzone commented 7 years ago

Same issue here

leewyang commented 7 years ago

Please look in your executor logs (Spark UI -> Executors -> stderr) for any errors.

zerodarkzone commented 7 years ago

Hi, this is the output I get in stderr. I'm using tensorflow compiled with the master branch.

log:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/07 15:11:46 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/11/07 15:11:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/11/07 15:11:47 INFO SecurityManager: Changing view acls to: david 17/11/07 15:11:47 INFO SecurityManager: Changing modify acls to: david 17/11/07 15:11:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(david); users with modify permissions: Set(david) 17/11/07 15:11:49 INFO SecurityManager: Changing view acls to: david 17/11/07 15:11:49 INFO SecurityManager: Changing modify acls to: david 17/11/07 15:11:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(david); users with modify permissions: Set(david) 17/11/07 15:11:50 INFO Slf4jLogger: Slf4jLogger started 17/11/07 15:11:50 INFO Remoting: Starting remoting 17/11/07 15:11:50 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@10.23.228.154:41052] 17/11/07 15:11:50 INFO Utils: Successfully started service 'sparkExecutorActorSystem' on port 41052. 17/11/07 15:11:51 INFO DiskBlockManager: Created local directory at /tmp/spark-20a25310-733b-4be9-9f7d-de6cc752629b/executor-13d0a6fe-2375-443a-98ea-9911e3876119/blockmgr-e019b4fd-7710-4d6d-aba9-62a24d102f2e 17/11/07 15:11:51 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 17/11/07 15:11:51 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.23.228.154:40023 17/11/07 15:11:51 INFO WorkerWatcher: Connecting to worker spark://Worker@10.23.228.154:42430 17/11/07 15:11:51 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 17/11/07 15:11:51 INFO Executor: Starting executor ID 1 on host 10.23.228.154 17/11/07 15:11:51 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46498. 17/11/07 15:11:51 INFO NettyBlockTransferService: Server created on 46498 17/11/07 15:11:51 INFO BlockManagerMaster: Trying to register BlockManager 17/11/07 15:11:51 INFO BlockManagerMaster: Registered BlockManager 17/11/07 15:11:51 INFO CoarseGrainedExecutorBackend: Got assigned task 1 17/11/07 15:11:51 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 17/11/07 15:11:52 INFO Executor: Fetching http://10.23.228.154:46615/files/mnist_dist.py with timestamp 1510085504389 17/11/07 15:11:52 INFO Utils: Fetching http://10.23.228.154:46615/files/mnist_dist.py to /tmp/spark-20a25310-733b-4be9-9f7d-de6cc752629b/executor-13d0a6fe-2375-443a-98ea-9911e3876119/spark-8f006f53-955d-495c-898b-12bddd021c6b/fetchFileTemp4241939507592543270.tmp 17/11/07 15:11:52 INFO Utils: Copying /tmp/spark-20a25310-733b-4be9-9f7d-de6cc752629b/executor-13d0a6fe-2375-443a-98ea-9911e3876119/spark-8f006f53-955d-495c-898b-12bddd021c6b/17577459481510085504389_cache to /home/david/TensorFlowOnSpark/spark-1.6.3-bin-hadoop2.6/work/app-20171107151144-0002/1/./mnist_dist.py 17/11/07 15:11:52 INFO Executor: Fetching http://10.23.228.154:46615/files/mnist_spark.py with timestamp 1510085504322 17/11/07 15:11:52 INFO Utils: Fetching http://10.23.228.154:46615/files/mnist_spark.py to /tmp/spark-20a25310-733b-4be9-9f7d-de6cc752629b/executor-13d0a6fe-2375-443a-98ea-9911e3876119/spark-8f006f53-955d-495c-898b-12bddd021c6b/fetchFileTemp2774939994065177720.tmp 17/11/07 15:11:52 INFO Utils: Copying /tmp/spark-20a25310-733b-4be9-9f7d-de6cc752629b/executor-13d0a6fe-2375-443a-98ea-9911e3876119/spark-8f006f53-955d-495c-898b-12bddd021c6b/4300411671510085504322_cache to /home/david/TensorFlowOnSpark/spark-1.6.3-bin-hadoop2.6/work/app-20171107151144-0002/1/./mnist_spark.py 17/11/07 15:11:53 INFO TorrentBroadcast: Started reading broadcast variable 2 17/11/07 15:11:53 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.4 KB, free 511.1 MB) 17/11/07 15:11:53 INFO TorrentBroadcast: Reading broadcast variable 2 took 215 ms 17/11/07 15:11:53 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 10.6 KB, free 511.1 MB) 2017-11-07 15:11:55,424 INFO (MainThread-13772) connected to server at ('10.23.228.154', 46874) 2017-11-07 15:11:55,426 INFO (MainThread-13772) TFSparkNode.reserve: {'authkey': '\xc2\xc95\xd5\xa0iF\xe2\x83\xa4\xd3\xf7\x02\xd4v\x9a', 'worker_num': 1, 'host': '10.23.228.154', 'tb_port': 0, 'addr': '/tmp/pymp-YSHIXZ/listener-kjEt3q', 'ppid': 13755, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 43182} 2017-11-07 15:11:57,432 INFO (MainThread-13772) node: {'addr': ('10.23.228.154', 35517), 'task_index': 0, 'job_name': 'ps', 'authkey': '\xdf\x1d\xd2\x12\x01NN\t\x98\x1d\xd7\xb0\x96\xe5$1', 'worker_num': 0, 'host': '10.23.228.154', 'ppid': 13753, 'port': 46742, 'tb_pid': 0, 'tb_port': 0} 2017-11-07 15:11:57,432 INFO (MainThread-13772) node: {'addr': '/tmp/pymp-YSHIXZ/listener-kjEt3q', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xc2\xc95\xd5\xa0iF\xe2\x83\xa4\xd3\xf7\x02\xd4v\x9a', 'worker_num': 1, 'host': '10.23.228.154', 'ppid': 13755, 'port': 43182, 'tb_pid': 0, 'tb_port': 0} 2017-11-07 15:11:57,445 INFO (MainThread-13772) Starting TensorFlow worker:0 on cluster node 1 on background process 17/11/07 15:11:57 INFO PythonRunner: Times: total = 3605, boot = 1442, init = 78, finish = 2085 17/11/07 15:11:57 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 999 bytes result sent to driver 17/11/07 15:11:57 INFO CoarseGrainedExecutorBackend: Got assigned task 2 17/11/07 15:11:57 INFO Executor: Running task 0.0 in stage 1.0 (TID 2) 17/11/07 15:11:57 INFO TorrentBroadcast: Started reading broadcast variable 3 17/11/07 15:11:57 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.6 KB, free 511.1 MB) 17/11/07 15:11:57 INFO TorrentBroadcast: Reading broadcast variable 3 took 26 ms 17/11/07 15:11:57 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 12.5 KB, free 511.1 MB) 17/11/07 15:11:57 INFO HadoopRDD: Input split: file:/home/david/TensorFlowOnSpark/examples/mnist/csv/train/images/part-00005:0+11173834 17/11/07 15:11:57 INFO TorrentBroadcast: Started reading broadcast variable 0 17/11/07 15:11:57 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.3 KB, free 511.1 MB) 17/11/07 15:11:57 INFO TorrentBroadcast: Reading broadcast variable 0 took 35 ms 17/11/07 15:11:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 280.8 KB, free 510.8 MB) 17/11/07 15:11:58 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/11/07 15:11:58 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/11/07 15:11:58 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/11/07 15:11:58 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/11/07 15:11:58 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 2017-11-07 15:11:58,595 INFO (MainThread-13801) 1: ======== worker:0 ======== 2017-11-07 15:11:58,595 INFO (MainThread-13801) 1: Cluster spec: {'ps': ['10.23.228.154:46742'], 'worker': ['10.23.228.154:43182']} 2017-11-07 15:11:58,595 INFO (MainThread-13801) 1: Using CPU E1107 15:11:58.700764039 13801 ev_epoll1_linux.c:1051] grpc epoll fd: 12 2017-11-07 15:11:58.706813: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.23.228.154:46742} 2017-11-07 15:11:58.706860: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:43182} 2017-11-07 15:11:58.707517: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:43182 17/11/07 15:11:58 INFO HadoopRDD: Input split: file:/home/david/TensorFlowOnSpark/examples/mnist/csv/train/labels/part-00005:0+245760 17/11/07 15:11:58 INFO TorrentBroadcast: Started reading broadcast variable 1 17/11/07 15:11:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 19.3 KB, free 510.8 MB) 17/11/07 15:11:58 INFO TorrentBroadcast: Reading broadcast variable 1 took 33 ms 17/11/07 15:11:59 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 280.8 KB, free 510.5 MB) 2017-11-07 15:11:59,303 INFO (MainThread-13821) Connected to TFSparkNode.mgr on 10.23.228.154, ppid=13755, state='running' tensorflow model path: file:///home/david/TensorFlowOnSpark/mnist_model 2017-11-07 15:11:59,327 INFO (MainThread-13821) mgr.state='running' 2017-11-07 15:11:59,328 INFO (MainThread-13821) Feeding partition <generator object load_stream at 0x7f06d4689500> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f06d463d790> 2017-11-07 15:11:59.511686: E tensorflow/core/distributed_runtime/master.cc:257] Master init: Unavailable: Trying to connect an http1.x server INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, Trying to connect an http1.x server 2017-11-07 15:11:59,514 INFO (MainThread-13801) Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, Trying to connect an http1.x server Process Process-2: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, *self._kwargs) File "mnist_dist.py", line 133, in map_fun with sv.managed_session(server.target) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1118, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1315, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call raise type(e)(node_def, op, message) UnavailableError: Trying to connect an http1.x server

leewyang commented 7 years ago

Are you using TF 1.4? I see this stack overflow which looks similar. (Can you try TF 1.3?)

lzzzzzzzzz commented 6 years ago

@leewyang i use tensorflow 1.3.0 tensorflowonspark 1.0.6

zerodarkzone commented 6 years ago

It works for me on Tensorflow 1.3 but it fails with 1.4. @lzzzzzzzzz could you show the stderr from your other executors?, the error I have only shows in one of them, the rest seem fine.

lzzzzzzzzz commented 6 years ago

@leewyang where can i find mnist_model ,beacause the job is waiting for model to be ready but it doesn't exists.

lzzzzzzzzz commented 6 years ago

@zerodarkzone i found mnist_model doesn't exists ,i have to upload on hdfs://master:9000/user/root/mnist_model

lzzzzzzzzz commented 6 years ago

@zerodarkzone i view the stderr and find one of the job is waiting for model to be ready and the model path is'hdfs://master:9000/user/root/mnist_model' so it stop forever!

djygithub commented 6 years ago

I have a similar issue with HDP 2.6.3, the training session doesn't finish, screenshot from 2017-11-10 16-35-54

here's what I did:

[root@I77700 TensorFlowOnSpark-master]# spark-submit \

--master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 3G \ --archives /root/mnistdata/mnist.zip#mnist \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \ --driver-library-path=$LIB_CUDA \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/mnist_data_setup.py \ --output mnist/csv \ --format csv Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set Spark1 will be picked by default . . 17/11/10 15:23:01 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.207.55.138 ApplicationMaster RPC port: 0 queue: default start time: 1510356090043 final status: SUCCEEDED tracking URL: http://I77700:8088/proxy/application_1510355100393_0008/ user: root 17/11/10 15:23:01 INFO ShutdownHookManager: Shutdown hook called 17/11/10 15:23:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9bc91c3-250a-45d9-8edc-23d4194d8eb6 [root@I77700 TensorFlowOnSpark-master]# hdfs dfs -ls /user/root Found 2 items drwxr-xr-x - root root 0 2017-11-10 15:23 /user/root/.sparkStaging drwxr-xr-x - root root 0 2017-11-10 15:21 /user/root/mnist

[root@I77700 TensorFlowOnSpark-master]# hdfs dfs -ls /user/root/mnist/csv/test/images Found 11 items -rw-r--r-- 1 root root 0 2017-11-10 15:23 /user/root/mnist/csv/test/images/_SUCCESS -rw-r--r-- 1 root root 1810248 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00000 -rw-r--r-- 1 root root 1806102 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00001 -rw-r--r-- 1 root root 1811128 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00002 -rw-r--r-- 1 root root 1812952 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00003 -rw-r--r-- 1 root root 1810946 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00004 -rw-r--r-- 1 root root 1835497 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00005 -rw-r--r-- 1 root root 1845261 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00006 -rw-r--r-- 1 root root 1850655 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00007 -rw-r--r-- 1 root root 1852712 2017-11-10 15:22 /user/root/mnist/csv/test/images/part-00008 -rw-r--r-- 1 root root 1833942 2017-11-10 15:23 /user/root/mnist/csv/test/images/part-00009 [root@I77700 TensorFlowOnSpark-master]# hdfs dfs -ls /user/root/mnist/csv/train/images Found 11 items -rw-r--r-- 1 root root 0 2017-11-10 15:22 /user/root/mnist/csv/train/images/_SUCCESS -rw-r--r-- 1 root root 9338236 2017-11-10 15:21 /user/root/mnist/csv/train/images/part-00000 -rw-r--r-- 1 root root 11231804 2017-11-10 15:21 /user/root/mnist/csv/train/images/part-00001 -rw-r--r-- 1 root root 11214784 2017-11-10 15:21 /user/root/mnist/csv/train/images/part-00002 -rw-r--r-- 1 root root 11226100 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00003 -rw-r--r-- 1 root root 11212767 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00004 -rw-r--r-- 1 root root 11173834 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00005 -rw-r--r-- 1 root root 11214285 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00006 -rw-r--r-- 1 root root 11201024 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00007 -rw-r--r-- 1 root root 11194141 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00008 -rw-r--r-- 1 root root 10449019 2017-11-10 15:22 /user/root/mnist/csv/train/images/part-00009

Here's the training script:

[root@I77700 TensorFlowOnSpark-master]# spark-submit \

--master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 3G \ --driver-cores 4 \ --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ --driver-library-path=$LIB_CUDA \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --mode train \ --model mnist_model

And stderr:

2017-11-10 15:27:44,654 INFO (MainThread-29907) Feeding partition <generator object load_stream at 0x2321780> into input queue <multiprocessing.queues.JoinableQueue object at 0x21d8ed0> 17/11/10 15:27:46 INFO PythonRunner: Times: total = 1599, boot = -547, init = 612, finish = 1534 17/11/10 15:27:46 INFO PythonRunner: Times: total = 61, boot = 2, init = 7, finish = 52 2017-11-10 15:27:46,253 INFO (MainThread-29887) 0: ======== worker:0 ======== 2017-11-10 15:27:46,253 INFO (MainThread-29887) 0: Cluster spec: {'worker': ['10.207.55.138:40096']} 2017-11-10 15:27:46,253 INFO (MainThread-29887) 0: Using CPU 2017-11-10 15:27:46.254213: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-11-10 15:27:46.254223: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-11-10 15:27:46.254226: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-11-10 15:27:46.254229: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-11-10 15:27:46.254232: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-11-10 15:27:46.258064: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:40096} 2017-11-10 15:27:46.258653: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:40096 tensorflow model path: hdfs://i77700:8020/user/root/mnist_model . . 17/11/10 15:27:43 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.207.55.138:43212 (size: 6.7 KB, free: 457.8 MB) 17/11/10 15:27:43 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008 17/11/10 15:27:43 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 1 (PythonRDD[10] at RDD at PythonRDD.scala:43) 17/11/10 15:27:43 INFO YarnClusterScheduler: Adding task set 1.0 with 10 tasks 17/11/10 15:27:43 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, I77700, partition 0,NODE_LOCAL, 2620 bytes) 17/11/10 15:27:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1575 ms on I77700 (1/1) 17/11/10 15:27:43 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/11/10 15:27:43 INFO DAGScheduler: ResultStage 0 (foreachPartition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:253) finished in 1.658 s 17/11/10 15:27:43 INFO DAGScheduler: Job 0 finished: foreachPartition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:253, took 1.687859 s 17/11/10 15:27:43 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on I77700:44197 (size: 6.7 KB, free: 1983.0 MB) 17/11/10 15:27:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on I77700:44197 (size: 29.8 KB, free: 1983.0 MB) 17/11/10 15:27:44 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on I77700:44197 (size: 29.8 KB, free: 1982.9 MB) 17/11/10 15:57:39 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 10.207.55.138:43212 in memory (size: 7.6 KB, free: 457.8 MB) 17/11/10 15:57:39 INFO BlockManagerInfo: Removed broadcast_2_piece0 on I77700:44197 in memory (size: 7.6 KB, free: 1982.9 MB) 17/11/10 15:57:39 INFO ContextCleaner: Cleaned accumulator 2

Then the job stops progressing Trying to attach two html thread dumps, one for the master one for the worker: threaddump1.txt threaddump2.txt

Thanks.

djygithub commented 6 years ago

Same situation with Spark2: screenshot from 2017-11-10 16-36-01 screenshot from 2017-11-10 16-39-32 screenshot from 2017-11-10 16-39-38 screenshot from 2017-11-10 16-39-52

Thread dumps: threaddump3.txt threaddump4.txt

leewyang commented 6 years ago

@lzzzzzzzzz @djygithub That symptom is usually associated with the fact that you don't have the libhdfs.so in your LD_LIBRARY_PATH. TensorFlow needs this in order to save/create the mnist_model on HDFS.

djygithub commented 6 years ago

Changed the spark log level from INFO to ALL: screenshot from 2017-11-11 10-03-13

Invoked training:

[root@I77700 TensorFlowOnSpark-master]# export HADOOP_HOME=/usr/hdp/current/hadoop-client/ [root@I77700 TensorFlowOnSpark-master]# export HADOOP_PREFIX=/usr/hdp/current/hadoop-client/ [root@I77700 TensorFlowOnSpark-master]# export HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client [root@I77700 TensorFlowOnSpark-master]# export HADOOP_CONF_DIR=/etc/hadoop/conf [root@I77700 TensorFlowOnSpark-master]# export HADOOP_COMMON_HOME=/usr/hdp/current/hadoop-client/ [root@I77700 TensorFlowOnSpark-master]# export HADOOP_MAPRED_HOME=/usr/hdp/current/hadoop-client/ [root@I77700 TensorFlowOnSpark-master]# export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec [root@I77700 TensorFlowOnSpark-master]# export PYTHONPATH=/usr/bin/python [root@I77700 TensorFlowOnSpark-master]# export HADOOP_PREFIX=/usr/hdp/current/hadoop-client [root@I77700 TensorFlowOnSpark-master]# export LIB_HDFS=/usr/hdp/2.6.0.3-8/usr/lib:/usr/hdp/current/hadoop-client/lib/native: [root@I77700 TensorFlowOnSpark-master]# export PYTHON_ROOT=/usr [root@I77700 TensorFlowOnSpark-master]# export LD_LIBRARY_PATH=${PATH} [root@I77700 TensorFlowOnSpark-master]# export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python [root@I77700 TensorFlowOnSpark-master]# export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python" [root@I77700 TensorFlowOnSpark-master]# export PATH=${PYTHON_ROOT}/bin/:$PATH [root@I77700 TensorFlowOnSpark-master]# export QUEUE=default [root@I77700 TensorFlowOnSpark-master]# export LIB_JVM=/usr/jdk64/jdk1.8.0_112/jre/lib/amd64 [root@I77700 TensorFlowOnSpark-master]# export SPARK_HOME=/usr/hdp/current/spark1-client [root@I77700 TensorFlowOnSpark-master]# export LIB_CUDA=/usr/local/cuda-8.0/targets/x86_64-linux/lib/ [root@I77700 TensorFlowOnSpark-master]# export CUDA_HOME=/root/Downloads/cuda [root@I77700 TensorFlowOnSpark-master]# export CUDA_HOME="/root/Downloads/cuda" [root@I77700 TensorFlowOnSpark-master]# export LIB_CUDA=/usr/local/cuda-8.0/targets/x86_64-linux/lib/ [root@I77700 TensorFlowOnSpark-master]# spark-submit \

--master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 3G \ --driver-cores 1 \ --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA:$LIB_JVM:$LIB_HDFS \ --driver-library-path=$LIB_CUDA \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --mode train \ --model mnist_model

Seems to be looping around this message, i tried to increase the yarn max contatiner size it just keeps asking for more:

17/11/11 09:34:39 INFO Client: Application report for application_1510418026588_0010 (state: RUNNING) 17/11/11 09:34:39 DEBUG Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.207.55.138 ApplicationMaster RPC port: 0 queue: default start time: 1510421599744 final status: UNDEFINED tracking URL: http://I77700:8088/proxy/application_1510418026588_0010/ user: root 17/11/11 09:34:40 TRACE ProtobufRpcEngine: 1: Call -> i77700/10.207.55.138:8050: getApplicationReport {application_id { id: 10 cluster_timestamp: 1510418026588 }} 17/11/11 09:34:40 DEBUG Client: IPC Client (669284403) connection to i77700/10.207.55.138:8050 from root sending #144 17/11/11 09:34:40 DEBUG Client: IPC Client (669284403) connection to i77700/10.207.55.138:8050 from root got value #144 17/11/11 09:34:40 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms 17/11/11 09:34:40 TRACE ProtobufRpcEngine: 1: Response <- i77700/10.207.55.138:8050: getApplicationReport {application_report { applicationId { id: 10 cluster_timestamp: 1510418026588 } user: "root" queue: "default" name: "mnist_spark.py" host: "10.207.55.138" rpc_port: 0 yarn_application_state: RUNNING trackingUrl: "http://I77700:8088/proxy/application_1510418026588_0010/" diagnostics: "" startTime: 1510421599744 finishTime: 0 final_application_status: APP_UNDEFINED app_resource_Usage { num_used_containers: 2 num_reserved_containers: 0 used_resources { memory: 6144 virtual_cores: 2 } reserved_resources { memory: 0 virtual_cores: 0 } needed_resources { memory: 6144 virtual_cores: 2 } memory_seconds: 455313 vcore_seconds: 150 queue_usage_percentage: 25.0 cluster_usage_percentage: 25.0 } originalTrackingUrl: "http://10.207.55.138:39700" currentApplicationAttemptId { application_id { id: 10 cluster_timestamp: 1510418026588 } attemptId: 1 } progress: 0.1 applicationType: "SPARK" log_aggregation_status: LOG_NOT_START unmanaged_application: false appNodeLabelExpression: "" amNodeLabelExpression: "" }} 17/11/11 09:34:40 INFO Client: Application report for application_1510418026588_0010 (state: RUNNING) screenshot from 2017-11-11 11-26-44

I've attached the console output and executor logs tfoshdfs.driver.stderr.txt tfoshdfs.driver.stdout.txt tfoshdfs.worker.stderr.txt

thanks again. tfoshdfs.traceall.txt

djygithub commented 6 years ago

Tried adding the full path for the mnist_model as follows, the results are the same:

11/12/2017 two export HADOOP_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_PREFIX=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_HDFS_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_CONF_DIR=/usr/hdp/2.6.0.3-8/hadoop/etc/hadoop export HADOOP_COMMON_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_MAPRED_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.0.3-8/hadoop/libexec export PYTHONPATH=/usr/bin/python export LIB_HDFS=/usr/hdp/2.6.0.3-8/usr/lib:/usr/hdp/current/hadoop-client/lib/native: export PYTHON_ROOT=/usr export LD_LIBRARY_PATH=${PATH} export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python" export PATH=${PYTHON_ROOT}/bin/:$PATH export QUEUE=default export LIB_JVM=/usr/jdk64/jdk1.8.0_112/bin/ export SPARK_HOME=/usr/hdp/current/spark1-client export LIB_CUDA=/usr/local/cuda-8.0/targets/x86_64-linux/lib/ export CUDA_HOME=/root/Downloads/cuda export JAVA_HOME=/usr/jdk64/jdk1.8.0_112/bin/

spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 2G \ --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA:$LIB_JVM:$LIB_HDFS \ --driver-library-path=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --mode train \ --model hdfs://I77700:8020/user/root/mnist_model

Tried again in client deploy-mode, same results

spark-submit --master yarn --deploy-mode client --queue default --num-executors 1 --executor-memory 2G --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA:$LIB_JVM:$LIB_HDFS --driver-library-path=$LIB_CUDA --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py --images mnist/csv/train/images --labels mnist/csv/train/labels --mode train --model hdfs://I77700:8020/user/root/mnist_model

Here's an expanded thread dump and stderr threaddump.client.txt stderr.client.txt

Thanks.

djygithub commented 6 years ago

itworks.txt Hello, just like you said, libhdfs.so is the culprit. Copied libhdfs.so: screenshot from 2017-11-13 06-11-16 screenshot from 2017-11-13 06-11-29 screenshot from 2017-11-13 06-11-58

Thanks a ton!

export HADOOP_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_PREFIX=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_HDFS_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_CONF_DIR=/usr/hdp/2.6.0.3-8/hadoop/etc/hadoop export HADOOP_COMMON_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_MAPRED_HOME=/usr/hdp/2.6.0.3-8/hadoop export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.0.3-8/hadoop/libexec export PYTHONPATH=/usr/bin/python export LIB_HDFS=/usr/hdp/2.6.0.3-8/usr/lib:/usr/hdp/current/hadoop-client/lib/native: export PYTHON_ROOT=/usr export LD_LIBRARY_PATH=${PATH} export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python" export PATH=${PYTHON_ROOT}/bin/:$PATH export QUEUE=default export LIB_JVM=/usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/ export SPARK_HOME=/usr/hdp/current/spark1-client export LIB_CUDA=/usr/local/cuda-8.0/targets/x86_64-linux/lib/ export CUDA_HOME=/root/Downloads/cuda export JAVA_HOME=/usr/jdk64/jdk1.8.0_112/bin/

spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 2G \ --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA:$LIB_JVM:$LIB_HDFS \ --driver-library-path=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --mode train \ --model hdfs://I77700:8020/user/root/mnist_model

17/11/13 06:04:42 INFO Client: Application report for application_1510578512329_0001 (state: FINISHED) 17/11/13 06:04:42 DEBUG Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.207.55.138 ApplicationMaster RPC port: 0 queue: default start time: 1510581842004 final status: SUCCEEDED tracking URL: http://I77700:8088/proxy/application_1510578512329_0001/ user: root 17/11/13 06:04:42 TRACE ProtobufRpcEngine: 1: Call -> I77700/10.207.55.138:8020: getFileInfo {src: "/user/root/.sparkStaging/application_1510578512329_0001"} 17/11/13 06:04:42 DEBUG Client: The ping interval is 60000 ms. 17/11/13 06:04:42 DEBUG Client: Connecting to I77700/10.207.55.138:8020 17/11/13 06:04:42 DEBUG Client: IPC Client (1117871068) connection to I77700/10.207.55.138:8020 from root: starting, having connections 2 17/11/13 06:04:42 DEBUG Client: IPC Client (1117871068) connection to I77700/10.207.55.138:8020 from root sending #105 17/11/13 06:04:42 DEBUG Client: IPC Client (1117871068) connection to I77700/10.207.55.138:8020 from root got value #105 17/11/13 06:04:42 DEBUG ProtobufRpcEngine: Call: getFileInfo took 3ms 17/11/13 06:04:42 TRACE ProtobufRpcEngine: 1: Response <- I77700/10.207.55.138:8020: getFileInfo {} 17/11/13 06:04:42 INFO ShutdownHookManager: Shutdown hook called 17/11/13 06:04:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-a9fe2931-5e40-480e-b1d9-3ca7834bdc43 17/11/13 06:04:42 DEBUG Client: stopping client from cache: org.apache.hadoop.ipc.Client@4b7dc788 17/11/13 06:04:42 DEBUG ShutdownHookManager: ShutdownHookManger complete shutdown. [root@I77700 ~]#

[root@I77700 ~]# hdfs dfs -ls /user/root Found 3 items drwxrwxrwx - root root 0 2017-11-13 06:04 /user/root/.sparkStaging drwxrwxrwx - root root 0 2017-11-10 15:21 /user/root/mnist drwxr-xr-x - yarn root 0 2017-11-13 06:04 /user/root/mnist_model [root@I77700 ~]#

lzzzzzzzzz commented 6 years ago

@djygithub you can look issue #33.

lzzzzzzzzz commented 6 years ago

@leewyang first time I try to export LD_LIBRARY_PATH=/home/gzs/hadoop/lib/native i have libhadoop.so in this path,and the warn can't load native is disappeared. but still stuck out. then I try to run commd with --conf LD_LIBRARY_PATH=/home/gzs/hadoop/lib/native but still stuck out. I look up issue #33,the I try to change file. mnist_dist.py line 115 logdir=none line 124 logdir=none It doesn't work. At last,I'm not sure if it's my problem. And tomorrow i will upload a detailed log report for you. Thank you for help!

djygithub commented 6 years ago

The other shoe (inference)

spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 2G \ --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA:$LIB_JVM:$LIB_HDFS \ --driver-library-path=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/test/images \ --labels mnist/csv/test/labels \ --mode inference \ --model hdfs://I77700:8020/user/root/mnist_model \ --output predictions

screenshot from 2017-11-13 07-50-21

Thanks again.

leewyang commented 6 years ago

@lzzzzzzzzz you will need to use --conf spark.executorEnv.LD_LIBRARY_PATH=/home/gzs/hadoop/lib/native. This sets the LD_LIBRARY_PATH on the executors (not just the driver). Note that the libhdfs.so file must be available at that location on the executors.

lzzzzzzzzz commented 6 years ago

executor0.txt executor1.txt executor2.txt master.txt @leewyang sometimes I will get stderr like this

lzzzzzzzzz commented 6 years ago

executor0.txt executor1.txt executor2.txt master.txt mosttime the stderr like this. throw I copy the libhdfs.so to all slaves in the same path.

lzzzzzzzzz commented 6 years ago

@leewyang Does the task require hardware? Each computer have 2.6G memory and 4 cores (cpu) and one of them does't have a gpu. And the job just take 1G memory.

leewyang commented 6 years ago

@lzzzzzzzzz for the KeyError: 'input' issue, that happens if:

Please see the FAQ for more details.

As for hardware, the MNIST example is fairly small, and should run on CPU nodes in under 1-2GB, and should complete in a couple minutes max.

Are you trying to run a Spark Standalone cluster or a Spark/YARN cluster? Can you describe your setup a bit more?

lzzzzzzzzz commented 6 years ago

finally.txt @leewyang thank you for all the time. I export LD_LIBRARY=usr/java/jdk/jre/amd64/server (And I put the libhdfs.so in the path.) and the whloe log I have already copyed in a txt.

Everything is difficult at first Thank you!

leewyang commented 6 years ago

Glad you got it working!