Closed Genie-Liu closed 6 years ago
Has fixed it!
I run TensorFlowOnSpark on a Spark Standalone cluster (Multiple Host). Referening the "Getting Started TensorFlowOnSpark on Hadoop Cluster", I notice that Note: TensorFlow requires the paths to libcuda*.so, libjvm.so, and libhdfs.so libraries to be set in the spark.executorEnv.LD_LIBRARY_PATH
then after adding the path LD_LIBRARY_PATH
, another problem come: class Path not set.
then by adding this path parameter: spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob)"
All the problem fixed!
Here's my command: ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER} \ --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob)" \ ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ --cluster_size ${SPARK_WORKER_INSTANCES} \ --images examples/mnist/csv/train/images \ --labels examples/mnist/csv/train/labels \ --format csv \ --mode train \ --model mnist_model
@Genie-Liu glad you got it working! I've added a note about the spark.executorEnv.CLASSPATH
to the wiki to help others...
@leewyang I'm afraid that you also need to mention the path of libjvm.so and include it in
to spark.executorEnv.CLASSPATH
. I've tried only inlcuding libhdfs.so, but it's still hanging while feeding data. After I included libjvm.so, the problem can be solved.
@Genie-Liu thanks, it was actually already there. (Maybe it needs to be highlighted more).
@leewyang I know, but the page is in yarn mode. There's nothing related with libjvm.so in page GetStarted_Standalone. I have this problem in standalone mode. Maybe it should be also mentioned in page Standalone mode.
@Genie-Liu ah, missed that fact somehow. I've updated the wiki with a note. Thanks!
Here's my problem: it stucks while feeding training data.
my command: ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER} \ --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_HDFS \ ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ --cluster_size ${SPARK_WORKER_INSTANCES} \ --images examples/mnist/csv/train/images \ --labels examples/mnist/csv/train/labels \ --format csv \ --mode train \ --model mnist_model
args: Namespace(batch_size=100, cluster_size=2, epochs=1, format='csv', images='examples/mnist/csv/train/images', labels='examples/mnist/csv/train/labels', mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False) 2018-06-28T11:20:15.374114 ===== Start zipping images and labels 2018-06-28 11:20:16,157 INFO (MainThread-40056) Reserving TFSparkNodes 2018-06-28 11:20:16,158 INFO (MainThread-40056) cluster_template: {'ps': range(0, 1), 'worker': range(1, 2)} 2018-06-28 11:20:16,162 INFO (MainThread-40056) listening for reservations at ('10.18.0.15', 60645) 2018-06-28 11:20:16,163 INFO (MainThread-40056) Starting TensorFlow on executors 2018-06-28 11:20:16,184 INFO (MainThread-40056) Waiting for TFSparkNodes to start 2018-06-28 11:20:16,184 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:17,186 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:18,187 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:19,188 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:20,190 INFO (MainThread-40056) all reservations completed 2018-06-28 11:20:20,190 INFO (MainThread-40056) All TFSparkNodes started 2018-06-28 11:20:20,190 INFO (MainThread-40056) {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:20:20,190 INFO (MainThread-40056) {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:20:20,191 INFO (MainThread-40056) Feeding training data
my cluster log:
first executor: Spark Executor Command: "/usr/local/jdk1.8.0_151/bin/java" "-cp" "/home/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars/:/home/hadoop/hadoop-2.7.2/etc/hadoop/:/home/hadoop/hadoop-2.7.2/share/hadoop/common/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/common/:/home/hadoop/hadoop-2.7.2/share/hadoop/hdfs/:/home/hadoop/hadoop-2.7.2/share/hadoop/hdfs/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/hdfs/:/home/hadoop/hadoop-2.7.2/share/hadoop/yarn/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/yarn/:/home/hadoop/hadoop-2.7.2/share/hadoop/mapreduce/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/mapreduce/:/home/hadoop/hadoop-2.7.2/contrib/capacity-scheduler/.jar" "-Xmx1024M" "-Dspark.driver.port=55958" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@spark01.td.com:55958" "--executor-id" "1" "--hostname" "10.18.0.19" "--cores" "1" "--app-id" "app-20180628112012-0018" "--worker-url" "spark://Worker@10.18.0.19:32817"
========================================
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] /home/hadoop/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
. from ._conv import register_converters as _register_converters 2018-06-28 11:20:21,737 INFO (MainThread-23473) connected to server at ('10.18.0.15', 60645) 2018-06-28 11:20:21,738 INFO (MainThread-23473) TFSparkNode.reserve: {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:20:22,741 INFO (MainThread-23473) node: {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:20:22,741 INFO (MainThread-23473) node: {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:20:22,742 INFO (MainThread-23473) Starting TensorFlow worker:0 as worker on cluster node 1 on background process 2018-06-28 11:20:22,756 INFO (MainThread-23487) 1: ======== worker:0 ======== 2018-06-28 11:20:22,757 INFO (MainThread-23487) 1: Cluster spec: {'ps': ['10.18.0.26:38748'], 'worker': ['10.18.0.19:47963']} 2018-06-28 11:20:22,757 INFO (MainThread-23487) 1: Using CPU 2018-06-28 11:20:22.760534: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-06-28 11:20:22.771309: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.18.0.26:38748} 2018-06-28 11:20:22.771347: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:47963} 2018-06-28 11:20:22.779980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:47963 tensorflow model path: hdfs://10.18.0.15/user/hadoop/mnist_model 2018-06-28 11:20:24,207 INFO (MainThread-23605) Connected to TFSparkNode.mgr on 10.18.0.19, executor=1, state='running' 2018-06-28 11:20:24,216 INFO (MainThread-23605) mgr.state='running' 2018-06-28 11:20:24,216 INFO (MainThread-23605) Feeding partition <itertools.chain object at 0x7f0535ffb048> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f0536010ba8>second executor: Spark Executor Command: "/usr/local/jdk1.8.0_151/bin/java" "-cp" "/home/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars/:/home/hadoop/hadoop-2.7.2/etc/hadoop/:/etc/hadoop/conf/:/home/hadoop/hadoop-2.7.2/lib/:/home/hadoop/hadoop-2.7.2/.//:/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/libexec/../../hadoop-yarn/lib/:/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/libexec/../../hadoop-yarn/.//:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*" "-Xmx1024M" "-Dspark.driver.port=55958" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@spark01.td.com:55958" "--executor-id" "0" "--hostname" "10.18.0.26" "--cores" "1" "--app-id" "app-20180628112012-0018" "--worker-url" "spark://Worker@10.18.0.26:46056"
========================================
/home/hadoop/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
. from ._conv import register_converters as _register_converters 2018-06-28 11:19:51,397 INFO (MainThread-21082) connected to server at ('10.18.0.15', 60645) 2018-06-28 11:19:51,399 INFO (MainThread-21082) TFSparkNode.reserve: {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:19:53,403 INFO (MainThread-21082) node: {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:19:53,403 INFO (MainThread-21082) node: {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:19:53,404 INFO (MainThread-21082) Starting TensorFlow ps:0 as ps on cluster node 0 on background process 2018-06-28 11:19:58,542 INFO (MainThread-21098) 0: ======== ps:0 ======== 2018-06-28 11:19:58,543 INFO (MainThread-21098) 0: Cluster spec: {'ps': ['10.18.0.26:38748'], 'worker': ['10.18.0.19:47963']} 2018-06-28 11:19:58,543 INFO (MainThread-21098) 0: Using CPU 2018-06-28 11:19:58.545791: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-06-28 11:19:58.554651: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:38748} 2018-06-28 11:19:58.554694: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.18.0.19:47963} 2018-06-28 11:19:58.561795: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:38748