UnimplementedError: File system scheme hdfs not implemented

Hi~ I ran my TensorflowOnSpark example MNIST on Yarn cluster

CentOS 7.3 Spark 1.6.0 (CDH) Python 2.7.12 Hadoop 2.6.0-cdh5.11.1 tensorflow (1.2.1) tensorflowonspark (1.0.0)

I encountered a weird problem. I ran the MNIST example about 50 times. half of them failed. After looking through all the logs from fails. I found:

There are 2 types of ERROR about "hdfs".
All failed applications contain the host(called "35", was allocated as executor) in the cluster
Some applications contain the host can ran successful (weird.. isn't it? )

I had checked the LD_LIBRARY_PATH on that "35" host.

$ ls -l /opt/cloudera/parcels/CDH/lib64/libhdfs.so*
lrwxrwxrwx. 1 root root    16 Jun  2 03:28 /opt/cloudera/parcels/CDH/lib64/libhdfs.so -> libhdfs.so.0.0.0
-rwxrwxrwx. 1 root root 69952 Jun  2 03:28 /opt/cloudera/parcels/CDH/lib64/libhdfs.so.0.0.0

Type1:

Logged in as: dr.who 
Logs for container_1501486406932_0048_01_000004
ResourceManager
RM Home
NodeManager
Tools
17/08/01 17:38:22 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 9166@hellowin35
17/08/01 17:38:22 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/08/01 17:38:23 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 17:38:23 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 17:38:23 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 17:38:24 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 17:38:24 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 17:38:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 17:38:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/08/01 17:38:24 INFO Remoting: Starting remoting
17/08/01 17:38:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@hellowin35:36969]
17/08/01 17:38:25 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@hellowin35:36969]
17/08/01 17:38:25 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 36969.
17/08/01 17:38:25 INFO storage.DiskBlockManager: Created local directory at /yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0048/blockmgr-065b53b3-b7f3-47ba-9d43-ecdf05965f33
17/08/01 17:38:25 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB
17/08/01 17:38:25 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.9.33:49046
17/08/01 17:38:25 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/08/01 17:38:25 INFO executor.Executor: Starting executor ID 3 on host hellowin35
17/08/01 17:38:25 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39052.
17/08/01 17:38:25 INFO netty.NettyBlockTransferService: Server created on 39052
17/08/01 17:38:25 INFO storage.BlockManager: external shuffle service port = 7337
17/08/01 17:38:25 INFO storage.BlockManagerMaster: Trying to register BlockManager
17/08/01 17:38:25 INFO storage.BlockManagerMaster: Registered BlockManager
17/08/01 17:38:25 INFO storage.BlockManager: Registering executor with local external shuffle service.
17/08/01 17:38:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
17/08/01 17:38:27 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/01 17:38:28 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
17/08/01 17:38:28 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.6 KB, free 2.1 GB)
17/08/01 17:38:28 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 446 ms
17/08/01 17:38:29 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.2 KB, free 2.1 GB)
2017-08-01 17:38:29,890 INFO (MainThread-9399) connected to server at ('hellowin33', 47263)
2017-08-01 17:38:29,892 INFO (MainThread-9399) TFSparkNode.reserve: {'authkey': '8\x11B\xcf\xb2ZF\x0c\x9d\x12\x80\xc5i}\xa6\x8c', 'worker_num': 1, 'host': 'hellowin35', 'tb_port': 0, 'addr': '/tmp/pymp-5tW_P0/listener-X42nb_', 'ppid': 9393, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 36372}
2017-08-01 17:38:31,901 INFO (MainThread-9399) node: {'addr': ('hellowin31', 37593), 'task_index': 0, 'job_name': 'ps', 'authkey': 'p\x11\xc6;\xe7\x03J!\xb8-\xe7\x08\x0b\xd8\x80\x9c', 'worker_num': 0, 'host': 'hellowin31', 'ppid': 13276, 'port': 35988, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:31,902 INFO (MainThread-9399) node: {'addr': '/tmp/pymp-5tW_P0/listener-X42nb_', 'task_index': 0, 'job_name': 'worker', 'authkey': '8\x11B\xcf\xb2ZF\x0c\x9d\x12\x80\xc5i}\xa6\x8c', 'worker_num': 1, 'host': 'hellowin35', 'ppid': 9393, 'port': 36372, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:31,902 INFO (MainThread-9399) node: {'addr': '/tmp/pymp-Yue0LD/listener-pPXRVD', 'task_index': 1, 'job_name': 'worker', 'authkey': '\xb6_\xcdp(lH\xe6\xb7g\xda\xd9T)\xae\xc6', 'worker_num': 2, 'host': 'hellowin31', 'ppid': 13278, 'port': 33884, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:31,902 INFO (MainThread-9399) node: {'addr': '/tmp/pymp-6kKyJj/listener-JNGx4u', 'task_index': 2, 'job_name': 'worker', 'authkey': '\xb0i\x0bF2LCU\x8fQ.\xccr|\x0c\x99', 'worker_num': 3, 'host': 'hellowin35', 'ppid': 9395, 'port': 38428, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:32,296 INFO (MainThread-9399) Starting TensorFlow worker:0 on cluster node 1 on background process
17/08/01 17:38:32 INFO python.PythonRunner: Times: total = 3168, boot = 549, init = 143, finish = 2476
17/08/01 17:38:32 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 954 bytes result sent to driver
17/08/01 17:38:32 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
17/08/01 17:38:32 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 4)
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 32.5 KB, free 2.1 GB)
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 14 ms
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 86.8 KB, free 2.1 GB)
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/08/01 17:38:32 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/images/part-00000:0+1810248
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 29 ms
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
17/08/01 17:38:34 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/labels/part-00000:0+40000
17/08/01 17:38:34 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/08/01 17:38:34 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 17:38:34 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 18 ms
17/08/01 17:38:34 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
2017-08-01 17:38:34,820 INFO (MainThread-9554) Connected to TFSparkNode.mgr on hellowin35, ppid=9393, state='running'
2017-08-01 17:38:34,830 INFO (MainThread-9554) Feeding partition <generator object load_stream at 0x2c9afa0> into input queue <multiprocessing.queues.JoinableQueue object at 0x2ce3210>
17/08/01 17:38:35 INFO python.PythonRunner: Times: total = 1053, boot = -1977, init = 2205, finish = 825
17/08/01 17:38:35 INFO python.PythonRunner: Times: total = 20, boot = 2, init = 6, finish = 12
2017-08-01 17:38:37,995 INFO (MainThread-9528) 1: ======== worker:0 ========
2017-08-01 17:38:37,995 INFO (MainThread-9528) 1: Cluster spec: {'ps': ['hellowin31:35988'], 'worker': ['hellowin35:36372', 'hellowin31:33884', 'hellowin35:38428']}
2017-08-01 17:38:37,995 INFO (MainThread-9528) 1: Using CPU
2017-08-01 17:38:37.997148: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997282: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997333: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997374: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997416: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:38.014298: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> hellowin31:35988}
2017-08-01 17:38:38.014404: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:36372, 1 -> hellowin31:33884, 2 -> hellowin35:38428}
2017-08-01 17:38:38.016751: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:36372
tensorflow model path: hdfs://hellowin30:8020/user/dingpengfei/mnist_model
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0048/container_1501486406932_0048_01_000004/__pyfiles__/mnist_dist.py", line 129, in map_fun
    save_model_secs=0)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in __init__
    self._summary_writer = _summary.FileWriter(self._logdir)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in __init__
    filename_suffix)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented

Type 2:

Logged in as: dr.who 
Logs for container_1501486406932_0036_01_000004
ResourceManager
RM Home
NodeManager
Tools
17/08/01 16:51:33 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 24585@hellowin33
17/08/01 16:51:33 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/08/01 16:51:33 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 16:51:33 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 16:51:33 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 16:51:34 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 16:51:34 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 16:51:34 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 16:51:34 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/08/01 16:51:34 INFO Remoting: Starting remoting
17/08/01 16:51:34 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@hellowin33:45798]
17/08/01 16:51:34 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@hellowin33:45798]
17/08/01 16:51:34 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 45798.
17/08/01 16:51:34 INFO storage.DiskBlockManager: Created local directory at /yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/blockmgr-125c5bab-cae3-4fd5-a27a-6e5ccb0975cf
17/08/01 16:51:34 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB
17/08/01 16:51:35 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.9.35:39393
17/08/01 16:51:35 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/08/01 16:51:35 INFO executor.Executor: Starting executor ID 3 on host hellowin33
17/08/01 16:51:35 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50966.
17/08/01 16:51:35 INFO netty.NettyBlockTransferService: Server created on 50966
17/08/01 16:51:35 INFO storage.BlockManager: external shuffle service port = 7337
17/08/01 16:51:35 INFO storage.BlockManagerMaster: Trying to register BlockManager
17/08/01 16:51:35 INFO storage.BlockManagerMaster: Registered BlockManager
17/08/01 16:51:35 INFO storage.BlockManager: Registering executor with local external shuffle service.
17/08/01 16:51:38 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
17/08/01 16:51:38 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/01 16:51:39 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
17/08/01 16:51:39 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.6 KB, free 2.1 GB)
17/08/01 16:51:39 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 212 ms
17/08/01 16:51:39 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.2 KB, free 2.1 GB)
2017-08-01 16:51:41,604 INFO (MainThread-24891) connected to server at ('hellowin35', 45357)
2017-08-01 16:51:41,607 INFO (MainThread-24891) TFSparkNode.reserve: {'authkey': '@\x90\x87\xf3\xdf\x04K\n\xac\xe8\xea!\xbd.\x0f"', 'worker_num': 1, 'host': 'hellowin33', 'tb_port': 0, 'addr': '/tmp/pymp-QWIP1q/listener-eL5eee', 'ppid': 24842, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 37071}
2017-08-01 16:51:42,611 INFO (MainThread-24891) node: {'addr': ('hellowin35', 43619), 'task_index': 0, 'job_name': 'ps', 'authkey': '@*,\x174KH\x8b\x96 \xb7yvB\xe9\x03', 'worker_num': 0, 'host': 'hellowin35', 'ppid': 24990, 'port': 37877, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,612 INFO (MainThread-24891) node: {'addr': '/tmp/pymp-QWIP1q/listener-eL5eee', 'task_index': 0, 'job_name': 'worker', 'authkey': '@\x90\x87\xf3\xdf\x04K\n\xac\xe8\xea!\xbd.\x0f"', 'worker_num': 1, 'host': 'hellowin33', 'ppid': 24842, 'port': 37071, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,612 INFO (MainThread-24891) node: {'addr': '/tmp/pymp-fkTMrU/listener-K_XNMe', 'task_index': 1, 'job_name': 'worker', 'authkey': '\x08P \x0e\x7f\x16N\xf2\xbb\xe39y"\xe0\xc6\x81', 'worker_num': 2, 'host': 'hellowin35', 'ppid': 24992, 'port': 37874, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,612 INFO (MainThread-24891) node: {'addr': '/tmp/pymp-Gyr_RH/listener-61e5HK', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x18\x12r\xcf\xbd\xd6B:\xbd\xe3~\x14\x8c\x18\xa88', 'worker_num': 3, 'host': 'hellowin33', 'ppid': 24840, 'port': 36814, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,854 INFO (MainThread-24891) Starting TensorFlow worker:0 on cluster node 1 on background process
17/08/01 16:51:42 INFO python.PythonRunner: Times: total = 3416, boot = 2048, init = 104, finish = 1264
17/08/01 16:51:42 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 954 bytes result sent to driver
17/08/01 16:51:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
17/08/01 16:51:42 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 4)
17/08/01 16:51:42 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/08/01 16:51:42 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 32.5 KB, free 2.1 GB)
17/08/01 16:51:42 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 20 ms
17/08/01 16:51:42 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 86.8 KB, free 2.1 GB)
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/08/01 16:51:43 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/images/part-00000:0+1810248
17/08/01 16:51:43 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/08/01 16:51:43 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 16:51:43 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 17 ms
17/08/01 16:51:43 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
2017-08-01 16:51:43,743 INFO (MainThread-24960) 1: ======== worker:0 ========
2017-08-01 16:51:43,743 INFO (MainThread-24960) 1: Cluster spec: {'ps': ['hellowin35:37877'], 'worker': ['hellowin33:37071', 'hellowin35:37874', 'hellowin33:36814']}
2017-08-01 16:51:43,743 INFO (MainThread-24960) 1: Using CPU
2017-08-01 16:51:43.744403: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744421: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744426: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744431: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744436: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.749436: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> hellowin35:37877}
2017-08-01 16:51:43.749478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:37071, 1 -> hellowin35:37874, 2 -> hellowin33:36814}
2017-08-01 16:51:43.750748: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:37071
tensorflow model path: hdfs://hellowin30:8020/user/dingpengfei/mnist_model
17/08/01 16:51:44 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/labels/part-00000:0+40000
17/08/01 16:51:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/08/01 16:51:44 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 16:51:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 9 ms
17/08/01 16:51:44 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
2017-08-01 16:51:44,177 INFO (MainThread-25078) Connected to TFSparkNode.mgr on hellowin33, ppid=24842, state='running'
2017-08-01 16:51:44,180 INFO (MainThread-25078) Feeding partition <generator object load_stream at 0x26caeb0> into input queue <multiprocessing.queues.JoinableQueue object at 0x55f9590>
17/08/01 16:51:44 INFO python.PythonRunner: Times: total = 589, boot = -985, init = 1047, finish = 527
17/08/01 16:51:44 INFO python.PythonRunner: Times: total = 18, boot = 4, init = 4, finish = 10
17/08/01 16:51:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO:tensorflow:Restoring parameters from hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498
2017-08-01 16:51:45,692 INFO (MainThread-24960) Restoring parameters from hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498
2017-08-01 16:51:55.368497: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 364f1965662ccbb1 with config: 

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Caused by op u'save/RestoreV2_7', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 759, in func
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 404, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 104, in map_fun
    saver = tf.train.Saver()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

2017-08-01 16:51:55,395 INFO (MainThread-24960) Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Caused by op u'save/RestoreV2_7', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 759, in func
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 404, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 104, in map_fun
    saver = tf.train.Saver()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Process Process-2:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 133, in map_fun
    with sv.managed_session(server.target) as sess:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session
    config=config)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 205, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1548, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Caused by op u'save/RestoreV2_7', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 759, in func
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 404, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 104, in map_fun
    saver = tf.train.Saver()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

yahoo / TensorFlowOnSpark

UnimplementedError: File system scheme hdfs not implemented #115