yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.87k stars 937 forks source link

Running Standalone cluster stuck while Feeding training data #302

Closed Genie-Liu closed 6 years ago

Genie-Liu commented 6 years ago

Here's my problem: it stucks while feeding training data.

my command: ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER} \ --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_HDFS \ ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ --cluster_size ${SPARK_WORKER_INSTANCES} \ --images examples/mnist/csv/train/images \ --labels examples/mnist/csv/train/labels \ --format csv \ --mode train \ --model mnist_model

args: Namespace(batch_size=100, cluster_size=2, epochs=1, format='csv', images='examples/mnist/csv/train/images', labels='examples/mnist/csv/train/labels', mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False) 2018-06-28T11:20:15.374114 ===== Start zipping images and labels 2018-06-28 11:20:16,157 INFO (MainThread-40056) Reserving TFSparkNodes 2018-06-28 11:20:16,158 INFO (MainThread-40056) cluster_template: {'ps': range(0, 1), 'worker': range(1, 2)} 2018-06-28 11:20:16,162 INFO (MainThread-40056) listening for reservations at ('10.18.0.15', 60645) 2018-06-28 11:20:16,163 INFO (MainThread-40056) Starting TensorFlow on executors 2018-06-28 11:20:16,184 INFO (MainThread-40056) Waiting for TFSparkNodes to start 2018-06-28 11:20:16,184 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:17,186 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:18,187 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:19,188 INFO (MainThread-40056) waiting for 2 reservations 2018-06-28 11:20:20,190 INFO (MainThread-40056) all reservations completed 2018-06-28 11:20:20,190 INFO (MainThread-40056) All TFSparkNodes started 2018-06-28 11:20:20,190 INFO (MainThread-40056) {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:20:20,190 INFO (MainThread-40056) {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:20:20,191 INFO (MainThread-40056) Feeding training data

my cluster log:

first executor: Spark Executor Command: "/usr/local/jdk1.8.0_151/bin/java" "-cp" "/home/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars/:/home/hadoop/hadoop-2.7.2/etc/hadoop/:/home/hadoop/hadoop-2.7.2/share/hadoop/common/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/common/:/home/hadoop/hadoop-2.7.2/share/hadoop/hdfs/:/home/hadoop/hadoop-2.7.2/share/hadoop/hdfs/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/hdfs/:/home/hadoop/hadoop-2.7.2/share/hadoop/yarn/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/yarn/:/home/hadoop/hadoop-2.7.2/share/hadoop/mapreduce/lib/:/home/hadoop/hadoop-2.7.2/share/hadoop/mapreduce/:/home/hadoop/hadoop-2.7.2/contrib/capacity-scheduler/.jar" "-Xmx1024M" "-Dspark.driver.port=55958" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@spark01.td.com:55958" "--executor-id" "1" "--hostname" "10.18.0.19" "--cores" "1" "--app-id" "app-20180628112012-0018" "--worker-url" "spark://Worker@10.18.0.19:32817"

========================================

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] /home/hadoop/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters 2018-06-28 11:20:21,737 INFO (MainThread-23473) connected to server at ('10.18.0.15', 60645) 2018-06-28 11:20:21,738 INFO (MainThread-23473) TFSparkNode.reserve: {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:20:22,741 INFO (MainThread-23473) node: {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:20:22,741 INFO (MainThread-23473) node: {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:20:22,742 INFO (MainThread-23473) Starting TensorFlow worker:0 as worker on cluster node 1 on background process 2018-06-28 11:20:22,756 INFO (MainThread-23487) 1: ======== worker:0 ======== 2018-06-28 11:20:22,757 INFO (MainThread-23487) 1: Cluster spec: {'ps': ['10.18.0.26:38748'], 'worker': ['10.18.0.19:47963']} 2018-06-28 11:20:22,757 INFO (MainThread-23487) 1: Using CPU 2018-06-28 11:20:22.760534: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-06-28 11:20:22.771309: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.18.0.26:38748} 2018-06-28 11:20:22.771347: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:47963} 2018-06-28 11:20:22.779980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:47963 tensorflow model path: hdfs://10.18.0.15/user/hadoop/mnist_model 2018-06-28 11:20:24,207 INFO (MainThread-23605) Connected to TFSparkNode.mgr on 10.18.0.19, executor=1, state='running' 2018-06-28 11:20:24,216 INFO (MainThread-23605) mgr.state='running' 2018-06-28 11:20:24,216 INFO (MainThread-23605) Feeding partition <itertools.chain object at 0x7f0535ffb048> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f0536010ba8>

second executor: Spark Executor Command: "/usr/local/jdk1.8.0_151/bin/java" "-cp" "/home/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars/:/home/hadoop/hadoop-2.7.2/etc/hadoop/:/etc/hadoop/conf/:/home/hadoop/hadoop-2.7.2/lib/:/home/hadoop/hadoop-2.7.2/.//:/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/libexec/../../hadoop-yarn/lib/:/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/libexec/../../hadoop-yarn/.//:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*" "-Xmx1024M" "-Dspark.driver.port=55958" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@spark01.td.com:55958" "--executor-id" "0" "--hostname" "10.18.0.26" "--cores" "1" "--app-id" "app-20180628112012-0018" "--worker-url" "spark://Worker@10.18.0.26:46056"

========================================

/home/hadoop/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters 2018-06-28 11:19:51,397 INFO (MainThread-21082) connected to server at ('10.18.0.15', 60645) 2018-06-28 11:19:51,399 INFO (MainThread-21082) TFSparkNode.reserve: {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:19:53,403 INFO (MainThread-21082) node: {'executor_id': 0, 'host': '10.18.0.26', 'job_name': 'ps', 'task_index': 0, 'port': 38748, 'tb_pid': 0, 'tb_port': 0, 'addr': ('10.18.0.26', 43520), 'authkey': b'\xee\x9e\xc3NU4O\xa6\xb2\x0e\x85\xf5e\x87V\xe7'} 2018-06-28 11:19:53,403 INFO (MainThread-21082) node: {'executor_id': 1, 'host': '10.18.0.19', 'job_name': 'worker', 'task_index': 0, 'port': 47963, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-76cxpen5/listener-q_8y8zvw', 'authkey': b')\xae \t\x1c\xd5J\xbe\x90\xa9xy\xd8\xc6\x14f'} 2018-06-28 11:19:53,404 INFO (MainThread-21082) Starting TensorFlow ps:0 as ps on cluster node 0 on background process 2018-06-28 11:19:58,542 INFO (MainThread-21098) 0: ======== ps:0 ======== 2018-06-28 11:19:58,543 INFO (MainThread-21098) 0: Cluster spec: {'ps': ['10.18.0.26:38748'], 'worker': ['10.18.0.19:47963']} 2018-06-28 11:19:58,543 INFO (MainThread-21098) 0: Using CPU 2018-06-28 11:19:58.545791: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-06-28 11:19:58.554651: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:38748} 2018-06-28 11:19:58.554694: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.18.0.19:47963} 2018-06-28 11:19:58.561795: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:38748

Genie-Liu commented 6 years ago

Has fixed it!

I run TensorFlowOnSpark on a Spark Standalone cluster (Multiple Host). Referening the "Getting Started TensorFlowOnSpark on Hadoop Cluster", I notice that Note: TensorFlow requires the paths to libcuda*.so, libjvm.so, and libhdfs.so libraries to be set in the spark.executorEnv.LD_LIBRARY_PATH

then after adding the path LD_LIBRARY_PATH, another problem come: class Path not set. then by adding this path parameter: spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob)" All the problem fixed!

Here's my command: ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER} \ --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob)" \ ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ --cluster_size ${SPARK_WORKER_INSTANCES} \ --images examples/mnist/csv/train/images \ --labels examples/mnist/csv/train/labels \ --format csv \ --mode train \ --model mnist_model

leewyang commented 6 years ago

@Genie-Liu glad you got it working! I've added a note about the spark.executorEnv.CLASSPATH to the wiki to help others...

Genie-Liu commented 6 years ago

@leewyang I'm afraid that you also need to mention the path of libjvm.so and include it in to spark.executorEnv.CLASSPATH. I've tried only inlcuding libhdfs.so, but it's still hanging while feeding data. After I included libjvm.so, the problem can be solved.

leewyang commented 6 years ago

@Genie-Liu thanks, it was actually already there. (Maybe it needs to be highlighted more).

Genie-Liu commented 6 years ago

@leewyang I know, but the page is in yarn mode. There's nothing related with libjvm.so in page GetStarted_Standalone. I have this problem in standalone mode. Maybe it should be also mentioned in page Standalone mode.

leewyang commented 6 years ago

@Genie-Liu ah, missed that fact somehow. I've updated the wiki with a note. Thanks!