yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.87k stars 938 forks source link

FailedPreconditionError: Environment variable HADOOP_HDFS_HOME not set #152

Closed djygithub closed 6 years ago

djygithub commented 6 years ago

Thanks for taking a look at this, probably a bonehead error on my part. Getting the above mentioned error while training in RHEL 7.3, HDP 2.6 environment using the mnist tensorflowonspark example. The mnist straight tensorflow example works just fine. Here's the code I used to invoke:

`export HADOOP_HOME=/usr/hdp/current/hadoop-client/ export HADOOP_PREFIX=/usr/hdp/current/hadoop-client/ export HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_COMMON_HOME=/usr/hdp/current/hadoop-client/ export HADOOP_MAPRED_HOME=/usr/hdp/current/hadoop-client/ export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec export PYTHONPATH=/usr/bin/python export HADOOP_PREFIX=/usr/hdp/current/hadoop-client export LIB_HDFS=/usr/hdp/2.6.0.3-8/usr/lib:/usr/hdp/current/hadoop-client/lib/native: export PYTHON_ROOT=/usr export LD_LIBRARY_PATH=${PATH} export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python" export PATH=${PYTHON_ROOT}/bin/:$PATH export QUEUE=default export LIB_JVM=$JAVA_HOME/lib/server export SPARK_HOME=/usr/hdp/current/spark2-client export CUDA_HOME="/root/Downloads/cuda" export LIB_CUDA="/root/Downloads/cuda/lib64" export LD_LIBRARY_PATH="/root/Downloads/cuda/lib64"

spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 4 \ --executor-memory 4G \ --py-files /home/TensorFlowOnSpark/tfspark.zip,/home/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --archives hdfs:///user/${USER}/python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \ --driver-library-path=$LIB_CUDA \ /home/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --steps 1 \ --mode train \ --model mnist_model`

Following is a snip of the failing Executor's stderr:

FailedPreconditionError: Environment variable HADOOP_HDFS_HOME not set [tfspark.three.b.withexports.txt](https://github.com/yahoo/TensorFlowOnSpark/files/1431266/tfspark.three.b.withexports.txt) Thanks, please let me know if you need more information. Thanks again.

leewyang commented 6 years ago

If you're seeing that error in the executor logs, then the issue is that the HADOOP_HDFS_HOME is not set on that node. So, you can either ensure the nodes have the proper env vars, or you can try using -conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client. Of course, this assumes that this path is correct across all of your executors.

djygithub commented 6 years ago

Thank you, added the parameter and got further, here's the new statement: spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 4 \ --executor-memory 4G \ --py-files /home/TensorFlowOnSpark/tfspark.zip,/home/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --archives hdfs:///user/${USER}/python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ --driver-library-path=$LIB_CUDA \ /home/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --steps 1 \ --mode train \ --model mnist_model

Got a failure concerning an uninitialized variable in one executor:

`2017-10-31 11:09:11,184 INFO (MainThread-365061) 3: Using GPU: 0 2017-10-31 11:09:11.440554: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.207.55.138:42145} 2017-10-31 11:09:11.440580: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.207.55.138:37587, 1 -> 10.207.55.138:37917, 2 -> localhost:34862} 2017-10-31 11:09:11.441093: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:34862 tensorflow model path: hdfs://desktop-7lo7e26:8020/user/root/mnist_model 2017-10-31 11:09:16.464662: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session fef30be75e0ea89a with config:

INFO:tensorflow:Starting queue runners. 2017-10-31 11:10:05,649 INFO (MainThread-365061) Starting queue runners. 2017-10-31T11:10:05.649205 session ready INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.FailedPreconditionError'>, Attempting to use uninitialized value Variable [[Node: Variable_S34 = _SendT=DT_INT32, client_terminated=false, recv_device="/job:worker/replica:0/task:2/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-5033878749039652642, tensor_name="edge_243_Variable", _device="/job:ps/replica:0/task:0/cpu:0"]] 2017-10-31 11:10:05,675 INFO (MainThread-365061) Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.FailedPreconditionError'>, Attempting to use uninitialized value Variable

Thanks again for your prompt and courteous assistance.

leewyang commented 6 years ago
  1. Which version of TF are you using?
  2. Did you delete /user/root/mnist_model prior to re-running the job?
djygithub commented 6 years ago
  1. from pip list tensorflow (1.2.0) tensorflow-gpu (1.2.0) tensorflowonspark (1.0.2)
  2. No, I don't believe it was allocated: [root@DESKTOP-7LO7E26 ~]# hdfs dfs -ls /user/root Found 4 items drwxr-xr-x - root root 0 2017-10-28 17:37 /user/root/.hiveJars drwxr-xr-x - root root 0 2017-10-31 11:42 /user/root/.sparkStaging drwxr-xr-x - root root 0 2017-10-29 07:55 /user/root/mnist -rw-r--r-- 1 root root 108565418 2017-10-29 07:19 /user/root/python.zip [root@DESKTOP-7LO7E26 ~]#

`

leewyang commented 6 years ago

Hmm, this looks related... (and you did have a CUDA_ERROR_OUT_OF_MEMORY in your original logs).

djygithub commented 6 years ago

Yes indeedeeee. Should this issue be closed as fixed, and continue with the related issue. I though they looked close thanks, here's some more log:

[root@DESKTOP-7LO7E26 ~]# nvidia-smi Tue Oct 31 14:29:03 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1050 Off | 0000:01:00.0 On | N/A | | 35% 27C P8 35W / 75W | 53MiB / 1996MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3623 G /usr/bin/X 51MiB | +-----------------------------------------------------------------------------+ [root@DESKTOP-7LO7E26 ~]#

djygithub commented 6 years ago

This is how it looks when running tensorflowgpu, also runs out of memory at the end:

[root@DESKTOP-7LO7E26 ~]# nvidia-smi Tue Oct 31 14:31:50 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1050 Off | 0000:01:00.0 On | N/A | | 35% 36C P0 59W / 75W | 1920MiB / 1996MiB | 87% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3623 G /usr/bin/X 45MiB | | 0 440753 C python 1871MiB | +-----------------------------------------------------------------------------+ [root@DESKTOP-7LO7E26 ~]# nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,temperature.gpu,memory.total,memory.free,pstate --format=csv -l 1 timestamp, utilization.gpu [%], utilization.memory [%], temperature.gpu, memory.total [MiB], memory.free [MiB], pstate 2017/10/31 14:31:55.637, 88 %, 51 %, 37, 1996 MiB, 76 MiB, P0 2017/10/31 14:31:56.637, 87 %, 51 %, 38, 1996 MiB, 76 MiB, P0

^C[root@DESKTOP-7LO7E26 ~]#

[root@DESKTOP-7LO7E26 ~]# python mnist_deep6.py Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz 2017-10-31 14:31:42.049822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-31 14:31:42.049847: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-31 14:31:42.049851: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-10-31 14:31:42.049854: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-31 14:31:42.049857: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-10-31 14:31:42.151288: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2017-10-31 14:31:42.151591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate (GHz) 1.455 pciBusID 0000:01:00.0 Total memory: 1.95GiB Free memory: 1.86GiB 2017-10-31 14:31:42.151604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 2017-10-31 14:31:42.151607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y 2017-10-31 14:31:42.151614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0) step 0, training accuracy 0.1 step 100, training accuracy 0.84 step 200, training accuracy 0.94

. step 49800, training accuracy 1 step 49900, training accuracy 1 2017-10-31 14:39:20.133814: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 523.38MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-31 14:39:20.133858: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.82GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-31 14:39:20.133868: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 958.90MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-31 14:39:20.489422: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.63GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-31 14:39:20.489449: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.73GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. test accuracy 0.990857 [root@DESKTOP-7LO7E26 ~]#

leewyang commented 6 years ago

It looks like you're trying to run 4 executors on a single physical node with one GPU. This probably won't work all that well (due to OOM errors that you're seeing). Note: we generally assume one full GPU per executor. You can try going down to 1 worker (and no PS, since there's no need to communicate variables), and it should work.

djygithub commented 6 years ago

Thanks I will give it a shot.

djygithub commented 6 years ago

I have been unable to find the parameters for "going down to 1 worker (and no PS, since there's no need to communicate variables)" please help. I tried cutting down the number of executors to 1 and got an assertion in line 240 of tfcluster.py "assert num_ps < num_executors" but that is as far as I got. I upped the number of executors to 2 and received another OOM, however the job is still running:

2017-11-01 08:54:11.058433: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 162.69M (170590208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

and

args: Namespace(batch_size=100, cluster_size=2, epochs=1, format='csv', images='mnist/csv/train/images', labels='mnist/csv/train/labels', mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1, tensorboard=False) 2017-11-01T08:54:01.215902 ===== Start zipping images and labels

Thanks again.

leewyang commented 6 years ago

Sorry, in that example, the num_ps is hard-coded to 1. You can just set it to 0.

djygithub commented 6 years ago

Thanks for the tip, i adjusted num_ps: screenshot from 2017-11-01 10-25-49

Then started the jobs:

spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 7G \ --driver-cores 4 \ --py-files /home/TensorFlowOnSpark/tfspark.zip,/home/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --archives hdfs:///user/${USER}/python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ --driver-library-path=$LIB_CUDA \ /home/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --steps 1 \ --mode train \ --model mnist_model

The first job completed for the first time:

screenshot from 2017-11-01 10-18-57

This is the current status, how long does this normally run? thanks.

screenshot from 2017-11-01 10-20-25

djygithub commented 6 years ago

Job did not finish, I will try upgrading tensorflow, tensorflow-gpu, and tensorflowonspark and update the results. Thanks

leewyang commented 6 years ago

Also try --steps 1000. And also look at the "Executors" tab for the "stderr" logs to see what the node is doing...

djygithub commented 6 years ago

Thanks again for your help. Re-imaged the I7-7700, Centos 7,

[root@I77700 ~]# uname -a Linux I77700 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CUDA 8.0.61 (base plus patch) with nvidia driver 375.26-2, CUDnn 6.0 (April 27, 2017 for Cuda 8), tensorflow (1.3.0), tensorflow-gpu (1.3.0), tensorflow-tensorboard (0.1.8), tensorflowonspark (1.0.7), Python 2.7. , HDP 2.6.3

https://www.tensorflow.org/install/install_linux#the_url_of_the_tensorflow_python_package

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.3.0-cp27-none-linux_x86_64.whl

Latest attempt, only 3 exports, hoping that spark-submit will fill in the blanks:

export LD_LIBRARY_PATH=/usr/local/cuda-8.0/targets/x86_64-linux/lib/ export CUDA_HOME=/root/Downloads/cuda export LIB_CUDA=/usr/local/cuda-8.0/targets/x86_64-linux/lib/

spark-submit \ --master yarn \ --deploy-mode cluster \ --queue default \ --num-executors 1 \ --executor-memory 3G \ --driver-cores 4 \ --py-files /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --archives hdfs:///user/${USER}/Python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \ --conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/current/hadoop-hdfs-client \ --driver-library-path=$LIB_CUDA \ /root/Downloads/TensorFlowOnSpark-master/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --steps 1000 \ --mode train \ --model mnist_model

Been running a short while, how long do 1,000 steps normally take? I'll let it cook over lunch thanks again.

17/11/02 10:50:15 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, I77700, partition 0,NODE_LOCAL, 2620 bytes) 17/11/02 10:50:15 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1681 ms on I77700 (1/1) 17/11/02 10:50:15 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/11/02 10:50:15 INFO DAGScheduler: ResultStage 0 (foreachPartition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:253) finished in 1.800 s 17/11/02 10:50:15 INFO DAGScheduler: Job 0 finished: foreachPartition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:253, took 1.838461 s 17/11/02 10:50:15 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on I77700:36534 (size: 6.7 KB, free: 1983.0 MB) 17/11/02 10:50:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on I77700:36534 (size: 29.8 KB, free: 1983.0 MB) 17/11/02 10:50:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on I77700:36534 (size: 29.8 KB, free: 1982.9 MB)

djygithub commented 6 years ago

Progress! 17/11/02 10:50:15 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on I77700:36534 (size: 6.7 KB, free: 1983.0 MB) 17/11/02 10:50:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on I77700:36534 (size: 29.8 KB, free: 1983.0 MB) 17/11/02 10:50:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on I77700:36534 (size: 29.8 KB, free: 1982.9 MB) 17/11/02 11:20:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 10.207.55.138:37183 in memory (size: 7.6 KB, free: 457.8 MB) 17/11/02 11:20:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on I77700:36534 in memory (size: 7.6 KB, free: 1982.9 MB) 17/11/02 11:20:11 INFO ContextCleaner: Cleaned accumulator 2

djygithub commented 6 years ago

Still running here's a snaphot of activity from nvidia-smi:

2017/11/02 12:37:58.792, 0 %, 2 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:37:59.792, 0 %, 2 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:00.792, 0 %, 2 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:01.793, 0 %, 2 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:02.793, 0 %, 2 %, 24, 1995 MiB, 239 MiB, P8

nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,temperature.gpu,memory.total,memory.free,pstate --format=csv -l 1

timestamp, utilization.gpu [%], utilization.memory [%], temperature.gpu, memory.total [MiB], memory.free [MiB], pstate 2017/11/02 12:38:15.713, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:16.714, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:17.714, 5 %, 5 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:18.715, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:19.715, 4 %, 5 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:20.715, 5 %, 5 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:21.715, 4 %, 5 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:22.716, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:23.716, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:24.716, 3 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:25.716, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:26.717, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:27.717, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:28.717, 4 %, 4 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:29.717, 2 %, 3 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:30.718, 1 %, 3 %, 24, 1995 MiB, 239 MiB, P8 2017/11/02 12:38:31.718, 1 %, 3 %, 24, 1995 MiB, 239 MiB, P8 ^C[root@I77700 ~]#

and from Nmon

screenshot from 2017-11-02 12-33-12

and from Spark:

screenshot from 2017-11-02 12-35-06

Updated at 13:08

args: Namespace(batch_size=100, cluster_size=1, epochs=1, format='csv', images='mnist/csv/train/images', labels='mnist/csv/train/labels', mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False) 2017-11-02T10:50:13.572613 ===== Start zipping images and labels 2017-11-02 10:50:13,918 INFO (MainThread-3677) Reserving TFSparkNodes 2017-11-02 10:50:13,919 INFO (MainThread-3677) listening for reservations at ('10.207.55.138', 33864) 2017-11-02 10:50:13,919 INFO (MainThread-3677) Starting TensorFlow on executors 2017-11-02 10:50:13,924 INFO (MainThread-3677) Waiting for TFSparkNodes to start 2017-11-02 10:50:13,924 INFO (MainThread-3677) waiting for 1 reservations 2017-11-02 10:50:14,925 INFO (MainThread-3677) all reservations completed 2017-11-02 10:50:14,926 INFO (MainThread-3677) All TFSparkNodes started 2017-11-02 10:50:14,926 INFO (MainThread-3677) {'addr': '/tmp/pymp-7SggDH/listener-rKJMk7', 'task_index': 0, 'port': 33944, 'authkey': '\xe2\xe4\xf5\xfd\x17pOB\x99\xdaH\x8d\xe5\xe8>-', 'worker_num': 0, 'host': '10.207.55.138', 'ppid': 3883, 'job_name': 'worker', 'tb_pid': 0, 'tb_port': 0} 2017-11-02 10:50:14,926 INFO (MainThread-3677) Feeding training data

17/11/02 10:50:15 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1681 ms on I77700 (1/1) 17/11/02 10:50:15 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/11/02 10:50:15 INFO DAGScheduler: ResultStage 0 (foreachPartition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:253) finished in 1.800 s 17/11/02 10:50:15 INFO DAGScheduler: Job 0 finished: foreachPartition at /usr/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py:253, took 1.838461 s 17/11/02 10:50:15 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on I77700:36534 (size: 6.7 KB, free: 1983.0 MB) 17/11/02 10:50:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on I77700:36534 (size: 29.8 KB, free: 1983.0 MB) 17/11/02 10:50:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on I77700:36534 (size: 29.8 KB, free: 1982.9 MB) 17/11/02 11:20:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 10.207.55.138:37183 in memory (size: 7.6 KB, free: 457.8 MB) 17/11/02 11:20:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on I77700:36534 in memory (size: 7.6 KB, free: 1982.9 MB) 17/11/02 11:20:11 INFO ContextCleaner: Cleaned accumulator 2

leewyang commented 6 years ago

Something's still not quite right. That job shouldn't take more than a few minutes (even with CPU).

Have you tried the Spark Standalone instructions? You might want to try that (especially if you're just working on a single node).

Also look at is the executor log (SparkUI -> Executors -> stderr) to see if there are any errors.

djygithub commented 6 years ago

Seems like a deadly embrace, is there a trace or some other tool I can use? Logs have not changed since the last post. I will look into a standalone spark, was hoping to add another few pages to HDFS/Spark/SSD Server Sizing Kit: Real Hardware, VMWare, & GCP Dataproc http://davidjyoung.com/cmg/hdfssparksizing.pdf . Thanks I'll get started.

djygithub commented 6 years ago

Erased ambari/hdp, installed standalone spark per instructions, installed tensorflow and tensorflowon spark:

tensorflow (1.4.0) tensorflow-tensorboard (0.4.0rc2) tensorflowonspark (1.0.7)

Loaded the csv files ok, getting the following error during training:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/04 09:15:12 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/11/04 09:15:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/11/04 09:15:12 INFO SecurityManager: Changing view acls to: root 17/11/04 09:15:12 INFO SecurityManager: Changing modify acls to: root 17/11/04 09:15:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 17/11/04 09:15:12 INFO SecurityManager: Changing view acls to: root 17/11/04 09:15:12 INFO SecurityManager: Changing modify acls to: root 17/11/04 09:15:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 17/11/04 09:15:13 INFO Slf4jLogger: Slf4jLogger started 17/11/04 09:15:13 INFO Remoting: Starting remoting 17/11/04 09:15:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@10.207.55.138:32781] 17/11/04 09:15:13 INFO Utils: Successfully started service 'sparkExecutorActorSystem' on port 32781. 17/11/04 09:15:13 INFO DiskBlockManager: Created local directory at /tmp/spark-629bc9c1-7b3b-4456-9d23-a5a20af02d44/executor-b86dc293-169e-439b-b355-393ef17fc4d6/blockmgr-299459cb-2cfd-46c3-a417-ea7cde4e5224 17/11/04 09:15:13 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 17/11/04 09:15:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.207.55.138:38663 17/11/04 09:15:13 INFO WorkerWatcher: Connecting to worker spark://Worker@10.207.55.138:40948 17/11/04 09:15:13 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 17/11/04 09:15:13 INFO Executor: Starting executor ID 0 on host I77700 17/11/04 09:15:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38765. 17/11/04 09:15:13 INFO NettyBlockTransferService: Server created on 38765 17/11/04 09:15:13 INFO BlockManagerMaster: Trying to register BlockManager 17/11/04 09:15:13 INFO BlockManagerMaster: Registered BlockManager 17/11/04 09:15:13 INFO CoarseGrainedExecutorBackend: Got assigned task 1 17/11/04 09:15:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 17/11/04 09:15:13 INFO Executor: Fetching http://10.207.55.138:35704/files/mnist_dist.py with timestamp 1509812111649 17/11/04 09:15:13 INFO Utils: Fetching http://10.207.55.138:35704/files/mnist_dist.py to /tmp/spark-629bc9c1-7b3b-4456-9d23-a5a20af02d44/executor-b86dc293-169e-439b-b355-393ef17fc4d6/spark-6d3bbdd4-d2a0-4ff2-9116-dd3c3ac4a295/fetchFileTemp3771729870112741147.tmp 17/11/04 09:15:13 INFO Utils: Copying /tmp/spark-629bc9c1-7b3b-4456-9d23-a5a20af02d44/executor-b86dc293-169e-439b-b355-393ef17fc4d6/spark-6d3bbdd4-d2a0-4ff2-9116-dd3c3ac4a295/-1455264511509812111649_cache to /root/Downloads/TensorFlowOnSpark-master/spark-1.6.0-bin-hadoop2.6/work/app-20171104091511-0002/0/./mnist_dist.py 17/11/04 09:15:13 INFO Executor: Fetching http://10.207.55.138:35704/files/mnist_spark.py with timestamp 1509812111645 17/11/04 09:15:13 INFO Utils: Fetching http://10.207.55.138:35704/files/mnist_spark.py to /tmp/spark-629bc9c1-7b3b-4456-9d23-a5a20af02d44/executor-b86dc293-169e-439b-b355-393ef17fc4d6/spark-6d3bbdd4-d2a0-4ff2-9116-dd3c3ac4a295/fetchFileTemp8911772127379555687.tmp 17/11/04 09:15:13 INFO Utils: Copying /tmp/spark-629bc9c1-7b3b-4456-9d23-a5a20af02d44/executor-b86dc293-169e-439b-b355-393ef17fc4d6/spark-6d3bbdd4-d2a0-4ff2-9116-dd3c3ac4a295/15581389421509812111645_cache to /root/Downloads/TensorFlowOnSpark-master/spark-1.6.0-bin-hadoop2.6/work/app-20171104091511-0002/0/./mnist_spark.py 17/11/04 09:15:13 INFO TorrentBroadcast: Started reading broadcast variable 2 17/11/04 09:15:13 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KB, free 7.3 KB) 17/11/04 09:15:13 INFO TorrentBroadcast: Reading broadcast variable 2 took 83 ms 17/11/04 09:15:13 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 10.5 KB, free 17.8 KB) 2017-11-04 09:15:14,070 INFO (MainThread-23106) connected to server at ('10.207.55.138', 36186) 2017-11-04 09:15:14,070 INFO (MainThread-23106) TFSparkNode.reserve: {'authkey': '\xb1h2\x85\x18VN\xf3\xbf\x17\xeb\x1a\xe5\x8c\xe0?', 'worker_num': 1, 'host': '10.207.55.138', 'tb_port': 0, 'addr': '/tmp/pymp-kS8mf2/listener-AuTUoe', 'ppid': 23084, 'task_index': 1, 'job_name': 'worker', 'tb_pid': 0, 'port': 40334} 2017-11-04 09:15:15,074 INFO (MainThread-23106) node: {'addr': '/tmp/pymp-gmhO_8/listener-LwKm1u', 'task_index': 0, 'job_name': 'worker', 'authkey': 'f\xd2\xbc^v(D>\x97\xcb\x01\x93\xcd\xf0X\x96', 'worker_num': 0, 'host': '10.207.55.138', 'ppid': 23086, 'port': 36640, 'tb_pid': 0, 'tb_port': 0} 2017-11-04 09:15:15,074 INFO (MainThread-23106) node: {'addr': '/tmp/pymp-kS8mf2/listener-AuTUoe', 'task_index': 1, 'job_name': 'worker', 'authkey': '\xb1h2\x85\x18VN\xf3\xbf\x17\xeb\x1a\xe5\x8c\xe0?', 'worker_num': 1, 'host': '10.207.55.138', 'ppid': 23084, 'port': 40334, 'tb_pid': 0, 'tb_port': 0} 2017-11-04 09:15:15,082 INFO (MainThread-23106) Starting TensorFlow worker:1 on cluster node 1 on background process 17/11/04 09:15:15 INFO PythonRunner: Times: total = 1286, boot = 240, init = 25, finish = 1021 17/11/04 09:15:15 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 992 bytes result sent to driver 17/11/04 09:15:15 INFO CoarseGrainedExecutorBackend: Got assigned task 2 17/11/04 09:15:15 INFO Executor: Running task 0.0 in stage 1.0 (TID 2) 17/11/04 09:15:15 INFO TorrentBroadcast: Started reading broadcast variable 3 17/11/04 09:15:15 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.5 KB, free 24.4 KB) 17/11/04 09:15:15 INFO TorrentBroadcast: Reading broadcast variable 3 took 8 ms 17/11/04 09:15:15 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 12.4 KB, free 36.8 KB) 17/11/04 09:15:15 INFO HadoopRDD: Input split: file:/root/Downloads/TensorFlowOnSpark-master/examples/mnist/csv/train/images/part-00000:0+9338236 17/11/04 09:15:15 INFO TorrentBroadcast: Started reading broadcast variable 0 17/11/04 09:15:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.3 KB, free 56.1 KB) 17/11/04 09:15:15 INFO TorrentBroadcast: Reading broadcast variable 0 took 8 ms 17/11/04 09:15:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 280.8 KB, free 336.9 KB) 17/11/04 09:15:15 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/11/04 09:15:15 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/11/04 09:15:15 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/11/04 09:15:15 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/11/04 09:15:15 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 17/11/04 09:15:15 INFO HadoopRDD: Input split: file:/root/Downloads/TensorFlowOnSpark-master/examples/mnist/csv/train/labels/part-00000:0+204800 17/11/04 09:15:15 INFO TorrentBroadcast: Started reading broadcast variable 1 17/11/04 09:15:15 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 19.3 KB, free 356.2 KB) 17/11/04 09:15:15 INFO TorrentBroadcast: Reading broadcast variable 1 took 6 ms 17/11/04 09:15:15 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 280.8 KB, free 637.0 KB) 2017-11-04 09:15:15,429 INFO (MainThread-23144) Connected to TFSparkNode.mgr on 10.207.55.138, ppid=23084, state='running' 2017-11-04 09:15:15,431 INFO (MainThread-23144) mgr.state='running' 2017-11-04 09:15:15,431 INFO (MainThread-23144) Feeding partition <generator object load_stream at 0x2895870> into input queue <multiprocessing.queues.JoinableQueue object at 0x28149d0> 17/11/04 09:15:16 INFO PythonRunner: Times: total = 1626, boot = -197, init = 200, finish = 1623 17/11/04 09:15:16 INFO PythonRunner: Times: total = 47, boot = 1, init = 3, finish = 43 2017-11-04 09:15:17,713 INFO (MainThread-23137) 1: ======== worker:1 ======== 2017-11-04 09:15:17,713 INFO (MainThread-23137) 1: Cluster spec: {'worker': ['10.207.55.138:36640', '10.207.55.138:40334']} 2017-11-04 09:15:17,713 INFO (MainThread-23137) 1: Using CPU 2017-11-04 09:15:17.714540: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E1104 09:15:17.714859986 23137 ev_epoll1_linux.c:1051] grpc epoll fd: 12 2017-11-04 09:15:17.716540: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.207.55.138:36640, 1 -> localhost:40334} 2017-11-04 09:15:17.716890: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:40334 tensorflow model path: file:///root/Downloads/TensorFlowOnSpark-master/mnist_model Process Process-2: Traceback (most recent call last): File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "mnist_dist.py", line 121, in map_fun save_model_secs=10) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 336, in init self._verify_setup() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 885, in _verify_setup "their device set: %s" % op) ValueError: When using replicas, all Variables must have their device set: name: "hid_w" op: "VariableV2" attr { key: "container" value { s: "" } } attr { key: "dtype" value { type: DT_FLOAT } } attr { key: "shape" value { shape { dim { size: 784 } dim { size: 128 } } } } attr { key: "shared_name" value { s: "" } }

I'll reboot and try again, thanks.

Same results after reboot.

djygithub commented 6 years ago

Closing this issue as fixed, the original question was resolved, opening a new issue thanks.