yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.87k stars 940 forks source link

UnimplementedError: File system scheme hdfs not implemented #115

Closed DingPengfei closed 7 years ago

DingPengfei commented 7 years ago

Hi~ I ran my TensorflowOnSpark example MNIST on Yarn cluster

CentOS 7.3 Spark 1.6.0 (CDH) Python 2.7.12 Hadoop 2.6.0-cdh5.11.1 tensorflow (1.2.1) tensorflowonspark (1.0.0)

I encountered a weird problem. I ran the MNIST example about 50 times. half of them failed. After looking through all the logs from fails. I found:

  1. There are 2 types of ERROR about "hdfs".
  2. All failed applications contain the host(called "35", was allocated as executor) in the cluster
  3. Some applications contain the host can ran successful (weird.. isn't it? )

I had checked the LD_LIBRARY_PATH on that "35" host.

$ ls -l /opt/cloudera/parcels/CDH/lib64/libhdfs.so*
lrwxrwxrwx. 1 root root    16 Jun  2 03:28 /opt/cloudera/parcels/CDH/lib64/libhdfs.so -> libhdfs.so.0.0.0
-rwxrwxrwx. 1 root root 69952 Jun  2 03:28 /opt/cloudera/parcels/CDH/lib64/libhdfs.so.0.0.0

Type1:

Logged in as: dr.who 
Logs for container_1501486406932_0048_01_000004
ResourceManager
RM Home
NodeManager
Tools
17/08/01 17:38:22 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 9166@hellowin35
17/08/01 17:38:22 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/08/01 17:38:23 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 17:38:23 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 17:38:23 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 17:38:24 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 17:38:24 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 17:38:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 17:38:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/08/01 17:38:24 INFO Remoting: Starting remoting
17/08/01 17:38:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@hellowin35:36969]
17/08/01 17:38:25 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@hellowin35:36969]
17/08/01 17:38:25 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 36969.
17/08/01 17:38:25 INFO storage.DiskBlockManager: Created local directory at /yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0048/blockmgr-065b53b3-b7f3-47ba-9d43-ecdf05965f33
17/08/01 17:38:25 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB
17/08/01 17:38:25 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.9.33:49046
17/08/01 17:38:25 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/08/01 17:38:25 INFO executor.Executor: Starting executor ID 3 on host hellowin35
17/08/01 17:38:25 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39052.
17/08/01 17:38:25 INFO netty.NettyBlockTransferService: Server created on 39052
17/08/01 17:38:25 INFO storage.BlockManager: external shuffle service port = 7337
17/08/01 17:38:25 INFO storage.BlockManagerMaster: Trying to register BlockManager
17/08/01 17:38:25 INFO storage.BlockManagerMaster: Registered BlockManager
17/08/01 17:38:25 INFO storage.BlockManager: Registering executor with local external shuffle service.
17/08/01 17:38:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
17/08/01 17:38:27 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/01 17:38:28 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
17/08/01 17:38:28 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.6 KB, free 2.1 GB)
17/08/01 17:38:28 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 446 ms
17/08/01 17:38:29 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.2 KB, free 2.1 GB)
2017-08-01 17:38:29,890 INFO (MainThread-9399) connected to server at ('hellowin33', 47263)
2017-08-01 17:38:29,892 INFO (MainThread-9399) TFSparkNode.reserve: {'authkey': '8\x11B\xcf\xb2ZF\x0c\x9d\x12\x80\xc5i}\xa6\x8c', 'worker_num': 1, 'host': 'hellowin35', 'tb_port': 0, 'addr': '/tmp/pymp-5tW_P0/listener-X42nb_', 'ppid': 9393, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 36372}
2017-08-01 17:38:31,901 INFO (MainThread-9399) node: {'addr': ('hellowin31', 37593), 'task_index': 0, 'job_name': 'ps', 'authkey': 'p\x11\xc6;\xe7\x03J!\xb8-\xe7\x08\x0b\xd8\x80\x9c', 'worker_num': 0, 'host': 'hellowin31', 'ppid': 13276, 'port': 35988, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:31,902 INFO (MainThread-9399) node: {'addr': '/tmp/pymp-5tW_P0/listener-X42nb_', 'task_index': 0, 'job_name': 'worker', 'authkey': '8\x11B\xcf\xb2ZF\x0c\x9d\x12\x80\xc5i}\xa6\x8c', 'worker_num': 1, 'host': 'hellowin35', 'ppid': 9393, 'port': 36372, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:31,902 INFO (MainThread-9399) node: {'addr': '/tmp/pymp-Yue0LD/listener-pPXRVD', 'task_index': 1, 'job_name': 'worker', 'authkey': '\xb6_\xcdp(lH\xe6\xb7g\xda\xd9T)\xae\xc6', 'worker_num': 2, 'host': 'hellowin31', 'ppid': 13278, 'port': 33884, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:31,902 INFO (MainThread-9399) node: {'addr': '/tmp/pymp-6kKyJj/listener-JNGx4u', 'task_index': 2, 'job_name': 'worker', 'authkey': '\xb0i\x0bF2LCU\x8fQ.\xccr|\x0c\x99', 'worker_num': 3, 'host': 'hellowin35', 'ppid': 9395, 'port': 38428, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 17:38:32,296 INFO (MainThread-9399) Starting TensorFlow worker:0 on cluster node 1 on background process
17/08/01 17:38:32 INFO python.PythonRunner: Times: total = 3168, boot = 549, init = 143, finish = 2476
17/08/01 17:38:32 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 954 bytes result sent to driver
17/08/01 17:38:32 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
17/08/01 17:38:32 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 4)
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 32.5 KB, free 2.1 GB)
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 14 ms
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 86.8 KB, free 2.1 GB)
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/08/01 17:38:32 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/08/01 17:38:32 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/images/part-00000:0+1810248
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 17:38:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 29 ms
17/08/01 17:38:32 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
17/08/01 17:38:34 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/labels/part-00000:0+40000
17/08/01 17:38:34 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/08/01 17:38:34 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 17:38:34 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 18 ms
17/08/01 17:38:34 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
2017-08-01 17:38:34,820 INFO (MainThread-9554) Connected to TFSparkNode.mgr on hellowin35, ppid=9393, state='running'
2017-08-01 17:38:34,830 INFO (MainThread-9554) Feeding partition <generator object load_stream at 0x2c9afa0> into input queue <multiprocessing.queues.JoinableQueue object at 0x2ce3210>
17/08/01 17:38:35 INFO python.PythonRunner: Times: total = 1053, boot = -1977, init = 2205, finish = 825
17/08/01 17:38:35 INFO python.PythonRunner: Times: total = 20, boot = 2, init = 6, finish = 12
2017-08-01 17:38:37,995 INFO (MainThread-9528) 1: ======== worker:0 ========
2017-08-01 17:38:37,995 INFO (MainThread-9528) 1: Cluster spec: {'ps': ['hellowin31:35988'], 'worker': ['hellowin35:36372', 'hellowin31:33884', 'hellowin35:38428']}
2017-08-01 17:38:37,995 INFO (MainThread-9528) 1: Using CPU
2017-08-01 17:38:37.997148: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997282: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997333: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997374: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:37.997416: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 17:38:38.014298: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> hellowin31:35988}
2017-08-01 17:38:38.014404: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:36372, 1 -> hellowin31:33884, 2 -> hellowin35:38428}
2017-08-01 17:38:38.016751: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:36372
tensorflow model path: hdfs://hellowin30:8020/user/dingpengfei/mnist_model
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0048/container_1501486406932_0048_01_000004/__pyfiles__/mnist_dist.py", line 129, in map_fun
    save_model_secs=0)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in __init__
    self._summary_writer = _summary.FileWriter(self._logdir)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in __init__
    filename_suffix)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented

Type 2:

Logged in as: dr.who 
Logs for container_1501486406932_0036_01_000004
ResourceManager
RM Home
NodeManager
Tools
17/08/01 16:51:33 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 24585@hellowin33
17/08/01 16:51:33 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/08/01 16:51:33 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 16:51:33 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 16:51:33 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 16:51:34 INFO spark.SecurityManager: Changing view acls to: yarn,dingpengfei
17/08/01 16:51:34 INFO spark.SecurityManager: Changing modify acls to: yarn,dingpengfei
17/08/01 16:51:34 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dingpengfei); users with modify permissions: Set(yarn, dingpengfei)
17/08/01 16:51:34 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/08/01 16:51:34 INFO Remoting: Starting remoting
17/08/01 16:51:34 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@hellowin33:45798]
17/08/01 16:51:34 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@hellowin33:45798]
17/08/01 16:51:34 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 45798.
17/08/01 16:51:34 INFO storage.DiskBlockManager: Created local directory at /yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/blockmgr-125c5bab-cae3-4fd5-a27a-6e5ccb0975cf
17/08/01 16:51:34 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB
17/08/01 16:51:35 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.9.35:39393
17/08/01 16:51:35 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/08/01 16:51:35 INFO executor.Executor: Starting executor ID 3 on host hellowin33
17/08/01 16:51:35 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50966.
17/08/01 16:51:35 INFO netty.NettyBlockTransferService: Server created on 50966
17/08/01 16:51:35 INFO storage.BlockManager: external shuffle service port = 7337
17/08/01 16:51:35 INFO storage.BlockManagerMaster: Trying to register BlockManager
17/08/01 16:51:35 INFO storage.BlockManagerMaster: Registered BlockManager
17/08/01 16:51:35 INFO storage.BlockManager: Registering executor with local external shuffle service.
17/08/01 16:51:38 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
17/08/01 16:51:38 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/01 16:51:39 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
17/08/01 16:51:39 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.6 KB, free 2.1 GB)
17/08/01 16:51:39 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 212 ms
17/08/01 16:51:39 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.2 KB, free 2.1 GB)
2017-08-01 16:51:41,604 INFO (MainThread-24891) connected to server at ('hellowin35', 45357)
2017-08-01 16:51:41,607 INFO (MainThread-24891) TFSparkNode.reserve: {'authkey': '@\x90\x87\xf3\xdf\x04K\n\xac\xe8\xea!\xbd.\x0f"', 'worker_num': 1, 'host': 'hellowin33', 'tb_port': 0, 'addr': '/tmp/pymp-QWIP1q/listener-eL5eee', 'ppid': 24842, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 37071}
2017-08-01 16:51:42,611 INFO (MainThread-24891) node: {'addr': ('hellowin35', 43619), 'task_index': 0, 'job_name': 'ps', 'authkey': '@*,\x174KH\x8b\x96 \xb7yvB\xe9\x03', 'worker_num': 0, 'host': 'hellowin35', 'ppid': 24990, 'port': 37877, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,612 INFO (MainThread-24891) node: {'addr': '/tmp/pymp-QWIP1q/listener-eL5eee', 'task_index': 0, 'job_name': 'worker', 'authkey': '@\x90\x87\xf3\xdf\x04K\n\xac\xe8\xea!\xbd.\x0f"', 'worker_num': 1, 'host': 'hellowin33', 'ppid': 24842, 'port': 37071, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,612 INFO (MainThread-24891) node: {'addr': '/tmp/pymp-fkTMrU/listener-K_XNMe', 'task_index': 1, 'job_name': 'worker', 'authkey': '\x08P \x0e\x7f\x16N\xf2\xbb\xe39y"\xe0\xc6\x81', 'worker_num': 2, 'host': 'hellowin35', 'ppid': 24992, 'port': 37874, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,612 INFO (MainThread-24891) node: {'addr': '/tmp/pymp-Gyr_RH/listener-61e5HK', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x18\x12r\xcf\xbd\xd6B:\xbd\xe3~\x14\x8c\x18\xa88', 'worker_num': 3, 'host': 'hellowin33', 'ppid': 24840, 'port': 36814, 'tb_pid': 0, 'tb_port': 0}
2017-08-01 16:51:42,854 INFO (MainThread-24891) Starting TensorFlow worker:0 on cluster node 1 on background process
17/08/01 16:51:42 INFO python.PythonRunner: Times: total = 3416, boot = 2048, init = 104, finish = 1264
17/08/01 16:51:42 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 954 bytes result sent to driver
17/08/01 16:51:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
17/08/01 16:51:42 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 4)
17/08/01 16:51:42 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/08/01 16:51:42 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 32.5 KB, free 2.1 GB)
17/08/01 16:51:42 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 20 ms
17/08/01 16:51:42 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 86.8 KB, free 2.1 GB)
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/08/01 16:51:43 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/08/01 16:51:43 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/images/part-00000:0+1810248
17/08/01 16:51:43 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/08/01 16:51:43 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 16:51:43 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 17 ms
17/08/01 16:51:43 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
2017-08-01 16:51:43,743 INFO (MainThread-24960) 1: ======== worker:0 ========
2017-08-01 16:51:43,743 INFO (MainThread-24960) 1: Cluster spec: {'ps': ['hellowin35:37877'], 'worker': ['hellowin33:37071', 'hellowin35:37874', 'hellowin33:36814']}
2017-08-01 16:51:43,743 INFO (MainThread-24960) 1: Using CPU
2017-08-01 16:51:43.744403: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744421: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744426: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744431: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.744436: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-01 16:51:43.749436: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> hellowin35:37877}
2017-08-01 16:51:43.749478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:37071, 1 -> hellowin35:37874, 2 -> hellowin33:36814}
2017-08-01 16:51:43.750748: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:37071
tensorflow model path: hdfs://hellowin30:8020/user/dingpengfei/mnist_model
17/08/01 16:51:44 INFO rdd.HadoopRDD: Input split: hdfs://hellowin30:8020/user/dingpengfei/mnist/csv/test/labels/part-00000:0+40000
17/08/01 16:51:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/08/01 16:51:44 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.9 KB, free 2.1 GB)
17/08/01 16:51:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 9 ms
17/08/01 16:51:44 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 376.1 KB, free 2.1 GB)
2017-08-01 16:51:44,177 INFO (MainThread-25078) Connected to TFSparkNode.mgr on hellowin33, ppid=24842, state='running'
2017-08-01 16:51:44,180 INFO (MainThread-25078) Feeding partition <generator object load_stream at 0x26caeb0> into input queue <multiprocessing.queues.JoinableQueue object at 0x55f9590>
17/08/01 16:51:44 INFO python.PythonRunner: Times: total = 589, boot = -985, init = 1047, finish = 527
17/08/01 16:51:44 INFO python.PythonRunner: Times: total = 18, boot = 4, init = 4, finish = 10
17/08/01 16:51:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO:tensorflow:Restoring parameters from hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498
2017-08-01 16:51:45,692 INFO (MainThread-24960) Restoring parameters from hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498
2017-08-01 16:51:55.368497: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 364f1965662ccbb1 with config: 

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Caused by op u'save/RestoreV2_7', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 759, in func
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 404, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 104, in map_fun
    saver = tf.train.Saver()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

2017-08-01 16:51:55,395 INFO (MainThread-24960) Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Caused by op u'save/RestoreV2_7', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 759, in func
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 404, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 104, in map_fun
    saver = tf.train.Saver()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Process Process-2:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 133, in map_fun
    with sv.managed_session(server.target) as sess:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session
    config=config)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 205, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1548, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Caused by op u'save/RestoreV2_7', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 759, in func
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 404, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/yarn/nm/usercache/dingpengfei/appcache/application_1501486406932_0036/container_1501486406932_0036_01_000004/__pyfiles__/mnist_dist.py", line 104, in map_fun
    saver = tf.train.Saver()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on hdfs://hellowin30:8020/user/dingpengfei/mnist_model/model.ckpt-498: Unimplemented: File system scheme hdfs not implemented
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]
leewyang commented 7 years ago

Per your description, it sounds like host 35 exhibits intermittent failures w.r.t. libhdfs.so.

FWIW, we had to add some code to expand the Hadoop classpath for Spark 2.x + JNI. So, it sounds like either: 1) this is returning slightly different classpaths on host 35 (vs. other nodes), OR 2) there's something intermittently mangling the LD_LIBRARY_PATH on that node, OR 3) something else entirely???

For 1 and 2, you can add some debug logs inside your mnist_dist.map_fun() to see what each TF process is seeing as it's environment. For 3, you can send me the full yarn.logs at leewyang@gmail.com (preferrably with the log statements from 1 and 2) and I can take a look. And last but not least, if it's just that one node causing problems, perhaps re-imaging it as a new node might help (if that's a possibility).

DingPengfei commented 7 years ago

Hi~ leewyang, I did 2 steps as you said. And I compared debug logs of the enviroment between success and fail using text-compare tool. They are exactly the same. I decide to rebuild that node as a new one. Thanks for your help~ btw, I have sent the logs to you.

DingPengfei commented 7 years ago

I have another question: I have installed the tensorflowonspark as a module in python. The Python.zip "distribution" that can be shipped to the Spark executors at runtime, Why should I have to Install TensorFlowOnSpark and send tfspark.zip to the spark executors?

leewyang commented 7 years ago

You only need to do one or the other... I've updated the instructions to try to clarify a bit. Thanks for the question/feedback.

xuande commented 7 years ago

@DingPengfei I have the same problem as you#125. Did you solve this problem? And if you did, could you share the solution with me? Thanks.

DingPengfei commented 7 years ago

My problem only happened on a same node. I can't figure out why that happened until now. As leeyang suggested, I just rebuilded that node as a new one.

xuande commented 7 years ago

@DingPengfei Thanks for your reply. This is cause by the version of tensorflw. After reinstall it runs normally #125

DingPengfei commented 7 years ago

That's great! I will try.

shaygeller commented 7 years ago

@xuande What are the versions of tensorflow and python that works for you? Were you able to save the model to HDFS? Thanks