yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.87k stars 940 forks source link

UnimplementedError: File system scheme hdfs not implemented #126

Closed shaygeller closed 7 years ago

shaygeller commented 7 years ago

Hello, I'm trying to run the mnist example, and save the tensorflow model in HDFS. I tried to change the logdir to logdir=None as suggested here, and it works fine but without saving the model to HDFS.

I use: SPARK 2.1 HADOOP: CDH5

But when I dont change the logdir, and keep it as
logdir = TFNode.hdfs_path(ctx, args.model) (when args.model is hdfs://namenode:8020/user/shay/mnist_model) Then the prosses halts and one of the nodes logs print: Process Process-2: Traceback (most recent call last): File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, self._kwargs) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/pyfiles/mnist_dist.py", line 121, in map_fun save_model_secs=10) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in init self._summary_writer = _summary.FileWriter(self._logdir) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in init filename_suffix) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in init gfile.MakeDirs(self._logdir) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3802/container_1504184820317_3802_01_000003/STF/stf/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) UnimplementedError: File system scheme hdfs not implemented**

I assume that the rest of the workers are waiting for something to happened.

Can you please help me to understand what happened? and how to save the tensorflow model into HDFS?

shaygeller commented 7 years ago

Hey I tried the approach here: https://github.com/yahoo/TensorFlowOnSpark/issues/125

I reinstalled Tensorflow v1.3.0 and python 2.7.12 I rerun the mnist example again, once with logdir=None, it worked, then I tried the original source line logdir = TFNode.hdfs_path(ctx, args.model) and it halts and throw this error:

17/09/04 16:16:29 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 20269@dt-dn4 17/09/04 16:16:29 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/09/04 16:16:30 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/04 16:16:30 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/04 16:16:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/04 16:16:30 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/04 16:16:30 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/04 16:16:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/04 16:16:31 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/09/04 16:16:31 INFO Remoting: Starting remoting 17/09/04 16:16:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@dt-dn4.dt.local:39840] 17/09/04 16:16:31 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@dt-dn4.dt.local:39840] 17/09/04 16:16:31 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 39840. 17/09/04 16:16:31 INFO storage.DiskBlockManager: Created local directory at /data1/yarn/nm/usercache/shay/appcache/application_1504184820317_3816/blockmgr-aa4657a5-937d-4f4f-8601-db906bcc4b9e 17/09/04 16:16:31 INFO storage.DiskBlockManager: Created local directory at /data2/yarn/nm/usercache/shay/appcache/application_1504184820317_3816/blockmgr-427b245b-82bc-499c-981a-291684f67ca9 17/09/04 16:16:31 INFO storage.DiskBlockManager: Created local directory at /data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3816/blockmgr-d14229bc-4561-459b-b64a-706cbfc257ff 17/09/04 16:16:31 INFO storage.DiskBlockManager: Created local directory at /data4/yarn/nm/usercache/shay/appcache/application_1504184820317_3816/blockmgr-c64bddb5-bcbe-4c70-9d9e-8cf6d4b23cd0 17/09/04 16:16:31 INFO storage.DiskBlockManager: Created local directory at /data5/yarn/nm/usercache/shay/appcache/application_1504184820317_3816/blockmgr-b5825fa7-6d1f-4f26-a0b2-808c3c58728e 17/09/04 16:16:31 INFO storage.MemoryStore: MemoryStore started with capacity 14.0 GB 17/09/04 16:16:31 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@132.72.80.27:39954 17/09/04 16:16:31 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 17/09/04 16:16:31 INFO executor.Executor: Starting executor ID 4 on host dt-dn4.dt.local 17/09/04 16:16:31 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36468. 17/09/04 16:16:31 INFO netty.NettyBlockTransferService: Server created on 36468 17/09/04 16:16:31 INFO storage.BlockManager: external shuffle service port = 7337 17/09/04 16:16:31 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/09/04 16:16:31 INFO storage.BlockManagerMaster: Registered BlockManager 17/09/04 16:16:31 INFO storage.BlockManager: Registering executor with local external shuffle service. 17/09/04 16:16:32 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1 17/09/04 16:16:32 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 17/09/04 16:16:33 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2 17/09/04 16:16:33 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.7 KB, free 14.0 GB) 17/09/04 16:16:33 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 127 ms 17/09/04 16:16:33 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.3 KB, free 14.0 GB) 2017-09-04 16:16:35,310 INFO (MainThread-20540) connected to server at ('132.72.80.27', 42917) 2017-09-04 16:16:35,313 INFO (MainThread-20540) TFSparkNode.reserve: {'authkey': '\x91\x15\n\x03\x1a?K\x95\xb5\xd1\x18\xcc\xe6\x9fWq', 'worker_num': 1, 'host': '132.72.80.29', 'tb_port': 0, 'addr': '/tmp/pymp-uMLy51/listener-Y7WX7X', 'ppid': 20482, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 36487} 2017-09-04 16:16:36,318 INFO (MainThread-20540) node: {'addr': ('132.72.80.26', 41475), 'task_index': 0, 'job_name': 'ps', 'authkey': '\x84/\x08\x10\x9a\x14A\x93\xb1\xb58\x1avkr\xe9', 'worker_num': 0, 'host': '132.72.80.26', 'ppid': 151591, 'port': 43521, 'tb_pid': 0, 'tb_port': 0} 2017-09-04 16:16:36,318 INFO (MainThread-20540) node: {'addr': '/tmp/pymp-uMLy51/listener-Y7WX7X', 'task_index': 0, 'job_name': 'worker', 'authkey': '\x91\x15\n\x03\x1a?K\x95\xb5\xd1\x18\xcc\xe6\x9fWq', 'worker_num': 1, 'host': '132.72.80.29', 'ppid': 20482, 'port': 36487, 'tb_pid': 0, 'tb_port': 0} 2017-09-04 16:16:36,318 INFO (MainThread-20540) node: {'addr': '/tmp/pymp-mIm_TK/listener-J55rqy', 'task_index': 1, 'job_name': 'worker', 'authkey': '\xba\xd3\x0b\x0f\xb2\x0fM\xfb\xb1\xd4\xba\x0f\x8aw@P', 'worker_num': 2, 'host': '132.72.80.26', 'ppid': 151587, 'port': 40440, 'tb_pid': 0, 'tb_port': 0} 2017-09-04 16:16:36,318 INFO (MainThread-20540) node: {'addr': '/tmp/pymp-OAsUb9/listener-mBU_AP', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x1d\x9fo\xafVz@\xe0\xb6[\x1e\x82co p', 'worker_num': 3, 'host': '132.72.80.26', 'ppid': 151589, 'port': 39239, 'tb_pid': 0, 'tb_port': 0} 2017-09-04 16:16:36,547 INFO (MainThread-20540) Starting TensorFlow worker:0 on cluster node 1 on background process 17/09/04 16:16:36 INFO python.PythonRunner: Times: total = 3195, boot = 1754, init = 181, finish = 1260 17/09/04 16:16:36 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 959 bytes result sent to driver 17/09/04 16:16:36 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4 17/09/04 16:16:36 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 4) 17/09/04 16:16:36 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/09/04 16:16:36 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 7.1 KB, free 14.0 GB) 17/09/04 16:16:36 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 19 ms 17/09/04 16:16:36 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 14.0 KB, free 14.0 GB) 17/09/04 16:16:36 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/images/part-00000:0+9338236 17/09/04 16:16:36 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0 17/09/04 16:16:36 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/04 16:16:36 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 12 ms 17/09/04 16:16:36 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 17/09/04 16:16:37 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/09/04 16:16:37 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/09/04 16:16:37 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/09/04 16:16:37 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/09/04 16:16:37 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 17/09/04 16:16:37 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/labels/part-00000:0+204800 17/09/04 16:16:37 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1 17/09/04 16:16:37 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/04 16:16:37 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 15 ms 17/09/04 16:16:37 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 2017-09-04 16:16:37,738 INFO (MainThread-20631) 1: ======== worker:0 ======== 2017-09-04 16:16:37,738 INFO (MainThread-20631) 1: Cluster spec: {'ps': ['132.72.80.26:43521'], 'worker': ['132.72.80.29:36487', '132.72.80.26:40440', '132.72.80.26:39239']} 2017-09-04 16:16:37,739 INFO (MainThread-20631) 1: Using CPU 2017-09-04 16:16:37.739908: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-04 16:16:37.739925: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-04 16:16:37.739932: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-04 16:16:37.739937: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-04 16:16:37.739942: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-09-04 16:16:37.750819: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 132.72.80.26:43521} 2017-09-04 16:16:37.750852: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:36487, 1 -> 132.72.80.26:40440, 2 -> 132.72.80.26:39239} 2017-09-04 16:16:37.754157: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:36487 2017-09-04 16:16:37,777 INFO (MainThread-20687) Connected to TFSparkNode.mgr on 132.72.80.29, ppid=20482, state='running' 2017-09-04 16:16:37,781 INFO (MainThread-20687) mgr.state='running' 2017-09-04 16:16:37,781 INFO (MainThread-20687) Feeding partition <generator object load_stream at 0x7f5d5a746eb0> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f5d5a41f250> tensorflow model path: hdfs://dt-nn1.dt.local:8020/user/shay/mnist_model #

A fatal error has been detected by the Java Runtime Environment:

#

SIGBUS (0x7) at pc=0x00007f5d157b8d4d, pid=20631, tid=0x00007f5df38c3740

#

JRE version: (8.0_112-b15) (build )

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)

Problematic frame:

j java.lang.Object.()V+0

#

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

#

An error report file with more information is saved as:

/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3816/container_1504184820317_3816_01_000005/hs_err_pid20631.log

#

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

# 17/09/04 16:16:39 INFO python.PythonRunner: Times: total = 2283, boot = -904, init = 984, finish = 2203 17/09/04 16:16:39 INFO python.PythonRunner: Times: total = 61, boot = 3, init = 8, finish = 50

Do you have a clue why it happeneds?

this is my conda environment: (stf1) [shay@dt-edge TensorFlowOnSpark]$ conda list

packages in environment at /home/shay/.conda/envs/stf1:

# asn1crypto 0.22.0 backports.weakref 1.0rc1 bleach 1.5.0 cffi 1.10.0 cryptography 2.0.3 enum34 1.1.6 funcsigs 1.0.2 html5lib 0.9999999 idna 2.6 ipaddress 1.0.18 Markdown 2.6.9 mock 2.0.0 numpy 1.13.1 openssl 1.0.2l 0 pbr 3.1.1 pip 9.0.1 py27_1 protobuf 3.4.0 pycparser 2.18 python 2.7.12 1 readline 6.2 2 scipy 0.19.1 setuptools 36.4.0 setuptools 27.2.0 py27_0 six 1.10.0 sqlite 3.13.0 0 tensorflow 1.3.0 tensorflow-tensorboard 0.1.5 tk 8.5.18 0 Werkzeug 0.12.2 wheel 0.29.0 wheel 0.29.0 py27_0 zlib 1.2.11 0

xuande commented 7 years ago

@shaygeller My environment is TensorFlow1.2.1 and Python2.7.13.

shaygeller commented 7 years ago

@xuande Thanks, I'll try to downgrade Tensorflow to 1.2.1 and see what happened.

shaygeller commented 7 years ago

Still get the UnimplementedError: File system scheme hdfs not implemented error and the process halts.

This is my excecuting command: PYSPARK_PYTHON=./STF/stf1/bin/python ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files tfspark.zip,examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./STF/stf1/bin/python --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS --archives hdfs:///user/${USER}/stf1.zip#STF /home/shay/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/train/images --labels mnist/csv/train/labels --mode train --output user/shay/model --model hdfs:///user/${USER}/mnist_model

where stf1.zip is a zipped conda environment with python2.7.13, and tensorflow1.2.1.

I exported the environment variables like this: LIB_JVM=./usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server

LD_LIBRARY_PATH=/opt/cloudera/parcels/Anaconda-4.2.0/bin:/opt/cloudera/parcels/Anaconda-4.2.0/bin:/opt/sbt/bin:/sbin:/bin:/usr/sbin:/usr/bin:/home/shay/.local/bin:/home/shay/bin

LIB_HDFS=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/impala/lib

Is there anything else I need to do to be able to save the model to HDFS?

xuande commented 7 years ago

@shaygeller Trying add --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" to spark-submit. I assume that ${LIB_HDFS} is the path to your libhdfs.so, and also that it's available on each of the executors at that path?

shaygeller commented 7 years ago

Thanks @xuande , I added this line to my spark-submit and now it looks like this: PYSPARK_PYTHON=./STF/stf1/bin/python ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files tfspark.zip,examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./STF/stf1/bin/python --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" --archives hdfs:///user/${USER}/stf1.zip#STF /home/shay/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/train/images --labels mnist/csv/train/labels --mode train --output user/shay/model --model hdfs://dt-nn1.dt.local:8020/user/shay/mnist_model

and still got the UnimplementedError: File system scheme hdfs not implemented error

This is the log from one of the executors:

17/09/05 11:11:44 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 31506@dt-dn3 17/09/05 11:11:44 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/09/05 11:11:44 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/05 11:11:44 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/05 11:11:44 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/05 11:11:45 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/05 11:11:45 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/05 11:11:45 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/05 11:11:45 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/09/05 11:11:45 INFO Remoting: Starting remoting 17/09/05 11:11:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@dt-dn3.dt.local:40964] 17/09/05 11:11:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@dt-dn3.dt.local:40964] 17/09/05 11:11:45 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 40964. 17/09/05 11:11:45 INFO storage.DiskBlockManager: Created local directory at /data1/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/blockmgr-04c98cea-d961-496f-a286-f1e320e87383 17/09/05 11:11:45 INFO storage.DiskBlockManager: Created local directory at /data2/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/blockmgr-9905a8a3-3e91-4ae7-8f80-65dd10967b84 17/09/05 11:11:45 INFO storage.DiskBlockManager: Created local directory at /data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/blockmgr-23f2bcb1-a98c-4f03-9e35-a418a8e7f9de 17/09/05 11:11:45 INFO storage.DiskBlockManager: Created local directory at /data4/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/blockmgr-b423f5ac-19a0-4446-836a-b1138219425a 17/09/05 11:11:45 INFO storage.DiskBlockManager: Created local directory at /data5/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/blockmgr-0c7004e4-9d3d-4b6e-bda5-f50af03f44af 17/09/05 11:11:45 INFO storage.MemoryStore: MemoryStore started with capacity 14.0 GB 17/09/05 11:11:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@132.72.80.27:35008 17/09/05 11:11:46 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 17/09/05 11:11:46 INFO executor.Executor: Starting executor ID 2 on host dt-dn3.dt.local 17/09/05 11:11:46 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33941. 17/09/05 11:11:46 INFO netty.NettyBlockTransferService: Server created on 33941 17/09/05 11:11:46 INFO storage.BlockManager: external shuffle service port = 7337 17/09/05 11:11:46 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/09/05 11:11:46 INFO storage.BlockManagerMaster: Registered BlockManager 17/09/05 11:11:46 INFO storage.BlockManager: Registering executor with local external shuffle service. 17/09/05 11:11:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1 17/09/05 11:11:47 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 17/09/05 11:11:47 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2 17/09/05 11:11:47 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 12.2 KB, free 14.0 GB) 17/09/05 11:11:47 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 141 ms 17/09/05 11:11:47 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 43.3 KB, free 14.0 GB) 2017-09-05 11:11:47,720 INFO (MainThread-31940) connected to server at ('132.72.80.27', 43416) 2017-09-05 11:11:47,721 INFO (MainThread-31940) TFSparkNode.reserve: {'authkey': '\xf2ozn\xe3\xd4E\x84\x9d\xber\x85\x94\xe1^e', 'worker_num': 1, 'host': '132.72.80.28', 'tb_port': 0, 'addr': '/tmp/pymp-AG368V/listener-7200xS', 'ppid': 31932, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 34537} 2017-09-05 11:11:49,728 INFO (MainThread-31940) node: {'addr': ('132.72.80.29', 45627), 'task_index': 0, 'job_name': 'ps', 'authkey': '+>$\xc7\\x9dKI\x8d\xf0L\xab\xdd\xc8:\xe0', 'worker_num': 0, 'host': '132.72.80.29', 'ppid': 103153, 'port': 35142, 'tb_pid': 0, 'tb_port': 0} 2017-09-05 11:11:49,728 INFO (MainThread-31940) node: {'addr': '/tmp/pymp-AG368V/listener-7200xS', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2ozn\xe3\xd4E\x84\x9d\xber\x85\x94\xe1^e', 'worker_num': 1, 'host': '132.72.80.28', 'ppid': 31932, 'port': 34537, 'tb_pid': 0, 'tbport': 0} 2017-09-05 11:11:49,728 INFO (MainThread-31940) node: {'addr': '/tmp/pymp-9LWIjd/listener-sQ4Ja', 'task_index': 1, 'job_name': 'worker', 'authkey': '\xa2/\xe8\xb1\xa35E\xca\x9c\xda\x14\x1a\xea\r\x16\xb5', 'worker_num': 2, 'host': '132.72.80.28', 'ppid': 31933, 'port': 39637, 'tb_pid': 0, 'tb_port': 0} 2017-09-05 11:11:49,728 INFO (MainThread-31940) node: {'addr': '/tmp/pymp-qvfF3v/listener-KX7vTg', 'task_index': 2, 'job_name': 'worker', 'authkey': '\xa4\xc4<t\xcf\x92O\xc4\xaa\x0e\xc7\x14\xbb\xa6\x14\xd2', 'worker_num': 3, 'host': '132.72.80.29', 'ppid': 103151, 'port': 42380, 'tb_pid': 0, 'tb_port': 0} 2017-09-05 11:11:49,939 INFO (MainThread-31940) Starting TensorFlow worker:0 on cluster node 1 on background process 17/09/05 11:11:49 INFO python.PythonRunner: Times: total = 2458, boot = 191, init = 36, finish = 2231 17/09/05 11:11:49 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 959 bytes result sent to driver 17/09/05 11:11:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4 17/09/05 11:11:49 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 4) 17/09/05 11:11:50 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/09/05 11:11:50 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 20.9 KB, free 14.0 GB) 17/09/05 11:11:50 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 13 ms 17/09/05 11:11:50 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 109.9 KB, free 14.0 GB) 17/09/05 11:11:50 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/images/part-00000:0+9338236 17/09/05 11:11:50 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0 17/09/05 11:11:50 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/05 11:11:50 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 12 ms 17/09/05 11:11:50 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 17/09/05 11:11:50 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/09/05 11:11:50 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/09/05 11:11:50 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/09/05 11:11:50 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/09/05 11:11:50 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 2017-09-05 11:11:50,958 INFO (MainThread-32112) 1: ======== worker:0 ======== 2017-09-05 11:11:50,959 INFO (MainThread-32112) 1: Cluster spec: {'ps': ['132.72.80.29:35142'], 'worker': ['132.72.80.28:34537', '132.72.80.28:39637', '132.72.80.29:42380']} 2017-09-05 11:11:50,959 INFO (MainThread-32112) 1: Using CPU 2017-09-05 11:11:50.960559: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-05 11:11:50.960574: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-05 11:11:50.960594: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-05 11:11:50.960598: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-05 11:11:50.960602: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. D0905 11:11:50.963862527 32112 env_linux.c:77] Warning: insecure environment read function 'getenv' used 2017-09-05 11:11:50.971490: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 132.72.80.29:35142} 2017-09-05 11:11:50.971506: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:34537, 1 -> 132.72.80.28:39637, 2 -> 132.72.80.29:42380} 2017-09-05 11:11:50.973186: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:34537 17/09/05 11:11:51 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/labels/part-00000:0+204800 17/09/05 11:11:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1 17/09/05 11:11:51 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/05 11:11:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 16 ms 17/09/05 11:11:51 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) tensorflow model path: hdfs://dt-nn1.dt.local:8020/user/shay/mnist_model 2017-09-05 11:11:51,237 INFO (MainThread-32367) Connected to TFSparkNode.mgr on 132.72.80.28, ppid=31932, state='running' 2017-09-05 11:11:51,241 INFO (MainThread-32367) mgr.state='running' 2017-09-05 11:11:51,241 INFO (MainThread-32367) Feeding partition <generator object load_stream at 0x7f0472cf9640> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f0472fc6790> Process Process-2: Traceback (most recent call last): File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/pyfiles/mnist_dist.py", line 121, in map_fun save_model_secs=10) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in init self._summary_writer = _summary.FileWriter(self._logdir) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in init filename_suffix) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in init gfile.MakeDirs(self._logdir) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status) File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3837/container_1504184820317_3837_01_000003/STF/stf1/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) UnimplementedError: File system scheme hdfs not implemented 17/09/05 11:11:53 INFO python.PythonRunner: Times: total = 2631, boot = -967, init = 1031, finish = 2567 17/09/05 11:11:53 INFO python.PythonRunner: Times: total = 54, boot = 4, init = 6, finish = 44

leewyang commented 7 years ago

@shaygeller how did you install your tensorflow? Note: @xuande used pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl

Also, I see you used: LIB_JVM=./usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server Note that this path end up being relative to the YARN container's "working-directory" hosting the Spark executor, so I don't think you want the leading period.

shaygeller commented 7 years ago

@leewyang Thanks for the answer, but it still not working. I deleted the dot in the LIB_JVM, but couldnt install tensorflow v1.2.1 from the line you wrote because I got an error: HTTP error 404 while getting https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl Could not install requirement tensorflow==1.2.1 from https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl because of error 404 Client Error: Not Found for url: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl Could not install requirement tensorflow==1.2.1 from https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl because of HTTP error 404 Client Error: Not Found for url: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl for URL https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl

So I followed the instruction from the tensorflow website, and used the line: pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.3.0-cp27-none-linux_x86_64.whl

Notice that its version 1.3.0 because I couldnt find a link for version 1.2.1 as you gave me.

Now I have the previos error:

17/09/06 11:53:18 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 166140@dt-dn2.dt.local 17/09/06 11:53:18 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/09/06 11:53:18 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/06 11:53:18 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/06 11:53:18 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/06 11:53:19 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/06 11:53:19 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/06 11:53:19 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/06 11:53:19 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/09/06 11:53:19 INFO Remoting: Starting remoting 17/09/06 11:53:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@dt-dn2.dt.local:34306] 17/09/06 11:53:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@dt-dn2.dt.local:34306] 17/09/06 11:53:19 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 34306. 17/09/06 11:53:19 INFO storage.DiskBlockManager: Created local directory at /data1/yarn/nm/usercache/shay/appcache/application_1504184820317_3914/blockmgr-998d3469-2d6a-4dab-9355-acef4243c87d 17/09/06 11:53:19 INFO storage.DiskBlockManager: Created local directory at /data2/yarn/nm/usercache/shay/appcache/application_1504184820317_3914/blockmgr-06509722-818e-41ac-889f-21ea1121b96d 17/09/06 11:53:19 INFO storage.DiskBlockManager: Created local directory at /data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3914/blockmgr-1e35cb87-bcfd-45ec-a6eb-73d84df9a9c3 17/09/06 11:53:19 INFO storage.DiskBlockManager: Created local directory at /data4/yarn/nm/usercache/shay/appcache/application_1504184820317_3914/blockmgr-35dbab27-77d5-4984-850f-9304ff7dd616 17/09/06 11:53:19 INFO storage.DiskBlockManager: Created local directory at /data5/yarn/nm/usercache/shay/appcache/application_1504184820317_3914/blockmgr-59168449-45d3-4bee-824a-94f0d46c8457 17/09/06 11:53:19 INFO storage.MemoryStore: MemoryStore started with capacity 14.0 GB 17/09/06 11:53:20 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@132.72.80.26:46684 17/09/06 11:53:20 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 17/09/06 11:53:20 INFO executor.Executor: Starting executor ID 4 on host dt-dn2.dt.local 17/09/06 11:53:20 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41620. 17/09/06 11:53:20 INFO netty.NettyBlockTransferService: Server created on 41620 17/09/06 11:53:20 INFO storage.BlockManager: external shuffle service port = 7337 17/09/06 11:53:20 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/09/06 11:53:20 INFO storage.BlockManagerMaster: Registered BlockManager 17/09/06 11:53:20 INFO storage.BlockManager: Registering executor with local external shuffle service. 17/09/06 11:53:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1 17/09/06 11:53:21 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 17/09/06 11:53:21 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2 17/09/06 11:53:21 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 12.2 KB, free 14.0 GB) 17/09/06 11:53:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 139 ms 17/09/06 11:53:21 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 43.3 KB, free 14.0 GB) 2017-09-06 11:53:21,876 INFO (MainThread-166690) connected to server at ('132.72.80.26', 34019) 2017-09-06 11:53:21,878 INFO (MainThread-166690) TFSparkNode.reserve: {'authkey': 'o\x91-\x7f\xb5\x17N~\xaf\x8bB\x8bi\xb0\xd4\x8e', 'worker_num': 1, 'host': '132.72.80.27', 'tb_port': 0, 'addr': '/tmp/pymp-e91Hy3/listener-3CJgEz', 'ppid': 166568, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 42294} 2017-09-06 11:53:23,884 INFO (MainThread-166690) node: {'addr': ('132.72.80.27', 38042), 'task_index': 0, 'job_name': 'ps', 'authkey': '\xd8\xb0\xaf\xc5m]F~\xa3?`\xb6\xe3\x80\xfa\xf0', 'worker_num': 0, 'host': '132.72.80.27', 'ppid': 166566, 'port': 34737, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 11:53:23,885 INFO (MainThread-166690) node: {'addr': '/tmp/pymp-e91Hy3/listener-3CJgEz', 'task_index': 0, 'job_name': 'worker', 'authkey': 'o\x91-\x7f\xb5\x17N~\xaf\x8bB\x8bi\xb0\xd4\x8e', 'worker_num': 1, 'host': '132.72.80.27', 'ppid': 166568, 'port': 42294, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 11:53:23,885 INFO (MainThread-166690) node: {'addr': '/tmp/pymp-B24ddP/listener-xJdLlK', 'task_index': 1, 'job_name': 'worker', 'authkey': '\x93\xe7\xa4s4EH<\x8c\xfd~\x1f\x9c\xdd\x18\x1a', 'worker_num': 2, 'host': '132.72.80.28', 'ppid': 106615, 'port': 43292, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 11:53:23,885 INFO (MainThread-166690) node: {'addr': '/tmp/pymp-T9gysm/listener-tZnD7T', 'task_index': 2, 'job_name': 'worker', 'authkey': 'eJ\\x1freF\xda\xb9p\x8b\xa1{5KX', 'worker_num': 3, 'host': '132.72.80.28', 'ppid': 106617, 'port': 38676, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 11:53:24,137 INFO (MainThread-166690) Starting TensorFlow worker:0 on cluster node 1 on background process 17/09/06 11:53:24 INFO python.PythonRunner: Times: total = 2610, boot = 294, init = 43, finish = 2273 17/09/06 11:53:24 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 959 bytes result sent to driver 17/09/06 11:53:24 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 5 17/09/06 11:53:24 INFO executor.Executor: Running task 2.0 in stage 1.0 (TID 5) 17/09/06 11:53:24 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/09/06 11:53:24 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 20.9 KB, free 14.0 GB) 17/09/06 11:53:24 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 13 ms 17/09/06 11:53:24 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 109.9 KB, free 14.0 GB) 17/09/06 11:53:24 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/images/part-00002:0+11214784 17/09/06 11:53:24 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0 17/09/06 11:53:24 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/06 11:53:24 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 13 ms 17/09/06 11:53:24 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 2017-09-06 11:53:24,626 INFO (MainThread-166868) 1: ======== worker:0 ======== 2017-09-06 11:53:24,627 INFO (MainThread-166868) 1: Cluster spec: {'ps': ['132.72.80.27:34737'], 'worker': ['132.72.80.27:42294', '132.72.80.28:43292', '132.72.80.28:38676']} 2017-09-06 11:53:24,627 INFO (MainThread-166868) 1: Using CPU 2017-09-06 11:53:24.627561: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 11:53:24.627574: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 11:53:24.627594: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 11:53:24.627598: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 11:53:24.627602: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 11:53:24.641708: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 132.72.80.27:34737} 2017-09-06 11:53:24.641737: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:42294, 1 -> 132.72.80.28:43292, 2 -> 132.72.80.28:38676} 2017-09-06 11:53:24.645610: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:42294 tensorflow model path: hdfs://dt-nn1.dt.local:8020/user/shay/mnist_model #

A fatal error has been detected by the Java Runtime Environment:

#

SIGBUS (0x7) at pc=0x00007ffa913b8d5b, pid=166868, tid=0x00007ffb6ef16740

#

JRE version: (8.0_112-b15) (build )

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)

Problematic frame:

j java.lang.Object.()V+0

#

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

#

An error report file with more information is saved as:

/data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3914/container_1504184820317_3914_01_000005/hs_err_pid166868.log

17/09/06 11:53:24 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/09/06 11:53:24 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/0

9/06 11:53:24 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/09/06 11:53:24 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/09/06 11:53:24 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id #

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

# 17/09/06 11:53:25 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/labels/part-00002:0+245760 17/09/06 11:53:25 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1 17/09/06 11:53:25 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/06 11:53:25 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 14 ms 17/09/06 11:53:25 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 2017-09-06 11:53:25,371 INFO (MainThread-167038) Connected to TFSparkNode.mgr on 132.72.80.27, ppid=166568, state='running' 2017-09-06 11:53:25,376 INFO (MainThread-167038) mgr.state='running' 2017-09-06 11:53:25,376 INFO (MainThread-167038) Feeding partition <generator object load_stream at 0x7ffad6092230> into input queue <multiprocessing.queues.JoinableQueue object at 0x7ffad6093310> 17/09/06 11:53:27 INFO python.PythonRunner: Times: total = 2675, boot = -875, init = 955, finish = 2595 17/09/06 11:53:27 INFO python.PythonRunner: Times: total = 100, boot = 3, init = 8, finish = 89

Is there a problem with TensorflowOnSpark support for tensorflow 1.3.0 version? Is there any way to install version 1.2.1? (I couldnt find a link similar to the one you gave me)

xuande commented 7 years ago

@shaygeller You can find tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl at here http://mirrors.163.com/pypi/simple/tensorflow/

shaygeller commented 7 years ago

@xuande Thanks for the response, but its still not working I did all from the begging. I created a conda env with conda create -n stf2 python=2.7.13. Then activated the new environment and installed tensorflow using the line: pip install --upgrade http://mirrors.163.com/pypi/simple/tensorflow/tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl

Then zipped it, uploaded it to HDFS and ran the command: PYSPARK_PYTHON=./STF/stf2/bin/python ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files tfspark.zip,examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./STF/stf2/bin/python --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" --archives hdfs:///user/${USER}/stf2.zip#STF /home/shay/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/train/images --labels mnist/csv/train/labels --mode train --output user/shay/model --model hdfs://dt-nn1.dt.local:8020/user/shay/mnist_model

when my environment variables are: LIB_JVM=/usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server LIB_HDFS=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib64 HADOOP_HOME=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41

and I still have the java error as before: Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

This is the complete log: 17/09/06 13:11:27 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 126959@dt-dn3 17/09/06 13:11:27 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/09/06 13:11:28 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/06 13:11:28 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/06 13:11:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/06 13:11:28 INFO spark.SecurityManager: Changing view acls to: yarn,shay 17/09/06 13:11:28 INFO spark.SecurityManager: Changing modify acls to: yarn,shay 17/09/06 13:11:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, shay); users with modify permissions: Set(yarn, shay) 17/09/06 13:11:28 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/09/06 13:11:29 INFO Remoting: Starting remoting 17/09/06 13:11:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@dt-dn3.dt.local:36427] 17/09/06 13:11:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutorActorSystem@dt-dn3.dt.local:36427] 17/09/06 13:11:29 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 36427. 17/09/06 13:11:29 INFO storage.DiskBlockManager: Created local directory at /data1/yarn/nm/usercache/shay/appcache/application_1504184820317_3919/blockmgr-0a6eea4e-73ea-4744-899c-a1dd9470cc9e 17/09/06 13:11:29 INFO storage.DiskBlockManager: Created local directory at /data2/yarn/nm/usercache/shay/appcache/application_1504184820317_3919/blockmgr-aebf1be2-8c24-4d1c-ba14-6dc6bb12fb26 17/09/06 13:11:29 INFO storage.DiskBlockManager: Created local directory at /data3/yarn/nm/usercache/shay/appcache/application_1504184820317_3919/blockmgr-7bdc4b6c-cd67-42b2-8557-ce3de5a21e19 17/09/06 13:11:29 INFO storage.DiskBlockManager: Created local directory at /data4/yarn/nm/usercache/shay/appcache/application_1504184820317_3919/blockmgr-9f73983c-2c6d-4183-98ff-e4b9e955d964 17/09/06 13:11:29 INFO storage.DiskBlockManager: Created local directory at /data5/yarn/nm/usercache/shay/appcache/application_1504184820317_3919/blockmgr-0f5043e1-13e2-47bd-af87-a24083a7c857 17/09/06 13:11:29 INFO storage.MemoryStore: MemoryStore started with capacity 14.0 GB 17/09/06 13:11:29 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@132.72.80.26:44068 17/09/06 13:11:29 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 17/09/06 13:11:29 INFO executor.Executor: Starting executor ID 1 on host dt-dn3.dt.local 17/09/06 13:11:29 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35587. 17/09/06 13:11:29 INFO netty.NettyBlockTransferService: Server created on 35587 17/09/06 13:11:29 INFO storage.BlockManager: external shuffle service port = 7337 17/09/06 13:11:29 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/09/06 13:11:29 INFO storage.BlockManagerMaster: Registered BlockManager 17/09/06 13:11:29 INFO storage.BlockManager: Registering executor with local external shuffle service. 17/09/06 13:11:35 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1 17/09/06 13:11:35 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 17/09/06 13:11:35 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2 17/09/06 13:11:35 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 12.2 KB, free 14.0 GB) 17/09/06 13:11:35 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 149 ms 17/09/06 13:11:35 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 43.3 KB, free 14.0 GB) 2017-09-06 13:11:36,041 INFO (MainThread-127501) connected to server at ('132.72.80.26', 36823) 2017-09-06 13:11:36,043 INFO (MainThread-127501) TFSparkNode.reserve: {'authkey': '\xfb9\xb7\x84M.O\xfd\x9d\x11Q\x87o\x0bG\x85', 'worker_num': 1, 'host': '132.72.80.28', 'tb_port': 0, 'addr': '/tmp/pymp-PPMmLr/listener-OazwAL', 'ppid': 127387, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 41431} 2017-09-06 13:11:39,051 INFO (MainThread-127501) node: {'addr': ('132.72.80.28', 37987), 'task_index': 0, 'job_name': 'ps', 'authkey': '\x92Pb]\x11\x1eBa\xbb\xe3l\xed\xed\x1f(\x80', 'worker_num': 0, 'host': '132.72.80.28', 'ppid': 127388, 'port': 33621, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 13:11:39,051 INFO (MainThread-127501) node: {'addr': '/tmp/pymp-PPMmLr/listener-OazwAL', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xfb9\xb7\x84M.O\xfd\x9d\x11Q\x87o\x0bG\x85', 'worker_num': 1, 'host': '132.72.80.28', 'ppid': 127387, 'port': 41431, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 13:11:39,051 INFO (MainThread-127501) node: {'addr': '/tmp/pymp-UKaC4f/listener-qd_A96', 'task_index': 1, 'job_name': 'worker', 'authkey': '\x87`e\x1e\xa7\x87@\xbd\x9d\xff\xed\xf7\xbc\x12.V', 'worker_num': 2, 'host': '132.72.80.27', 'ppid': 185273, 'port': 44617, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 13:11:39,051 INFO (MainThread-127501) node: {'addr': '/tmp/pymp-b0Mjk9/listener-4OZtRv', 'task_index': 2, 'job_name': 'worker', 'authkey': '\xe7?B5\x93qG=\x906V\xae\xd1\xeb\xb2\xf4', 'worker_num': 3, 'host': '132.72.80.27', 'ppid': 185272, 'port': 42499, 'tb_pid': 0, 'tb_port': 0} 2017-09-06 13:11:39,268 INFO (MainThread-127501) Starting TensorFlow worker:0 on cluster node 1 on background process 17/09/06 13:11:39 INFO python.PythonRunner: Times: total = 3626, boot = 331, init = 53, finish = 3242 17/09/06 13:11:39 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 959 bytes result sent to driver 17/09/06 13:11:39 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6 17/09/06 13:11:39 INFO executor.Executor: Running task 1.0 in stage 1.0 (TID 6) 17/09/06 13:11:39 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/09/06 13:11:39 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 20.9 KB, free 14.0 GB) 17/09/06 13:11:39 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 145 ms 17/09/06 13:11:39 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 109.9 KB, free 14.0 GB) 17/09/06 13:11:39 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/images/part-00001:0+11231804 17/09/06 13:11:39 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0 17/09/06 13:11:39 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/06 13:11:39 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 14 ms 17/09/06 13:11:39 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 17/09/06 13:11:40 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/09/06 13:11:40 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/09/06 13:11:40 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/09/06 13:11:40 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/09/06 13:11:40 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 2017-09-06 13:11:40,266 INFO (MainThread-127717) 1: ======== worker:0 ======== 2017-09-06 13:11:40,266 INFO (MainThread-127717) 1: Cluster spec: {'ps': ['132.72.80.28:33621'], 'worker': ['132.72.80.28:41431', '132.72.80.27:44617', '132.72.80.27:42499']} 2017-09-06 13:11:40,266 INFO (MainThread-127717) 1: Using CPU 2017-09-06 13:11:40.277693: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 13:11:40.277722: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 13:11:40.277728: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 13:11:40.277732: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 13:11:40.277737: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-09-06 13:11:40.300715: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 132.72.80.28:33621} 2017-09-06 13:11:40.300745: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:41431, 1 -> 132.72.80.27:44617, 2 -> 132.72.80.27:42499} 2017-09-06 13:11:40.303880: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:41431 tensorflow model path: hdfs://dt-nn1.dt.local:8020/user/shay/mnist_model #

A fatal error has been detected by the Java Runtime Environment:

#

SIGBUS (0x7) at pc=0x00007ff49af6f03d, pid=127717, tid=140692059342656

#

JRE version: (7.0_67-b01) (build )

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)

Problematic frame:

j java.lang.Object.()V+0

#

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

#

An error report file with more information is saved as:

/data5/yarn/nm/usercache/shay/appcache/application_1504184820317_3919/container_1504184820317_3919_01_000002/hs_err_pid127717.log

#

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

# 17/09/06 13:11:40 INFO rdd.HadoopRDD: Input split: hdfs://dt-nn1.dt.local:8020/user/shay/mnist/csv/train/labels/part-00001:0+245760 17/09/06 13:11:40 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1 17/09/06 13:11:40 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.4 KB, free 14.0 GB) 17/09/06 13:11:40 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 44 ms 17/09/06 13:11:40 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 379.1 KB, free 14.0 GB) 2017-09-06 13:11:40,751 INFO (MainThread-127913) Connected to TFSparkNode.mgr on 132.72.80.28, ppid=127387, state='running' 2017-09-06 13:11:40,755 INFO (MainThread-127913) mgr.state='running' 2017-09-06 13:11:40,755 INFO (MainThread-127913) Feeding partition <generator object load_stream at 0x7ff4d3500230> into input queue <multiprocessing.queues.JoinableQueue object at 0x7ff4d3501310> 17/09/06 13:11:43 INFO python.PythonRunner: Times: total = 2734, boot = -1118, init = 1196, finish = 2656 17/09/06 13:11:43 INFO python.PythonRunner: Times: total = 85, boot = 3, init = 10, finish = 72

leewyang commented 7 years ago

@shaygeller the fact that you're getting core dumps in Java seems to imply some fairly low-level issue, i.e. some incompatible lib or something. Does the hs_err_pid*.log show anything useful?

shaygeller commented 7 years ago

@leewyang I was working on something else for few days. Ill check it tomorrow. Thanks

leewyang commented 7 years ago

Closing due to inactivity. Feel free to reopen if needed.