Closed DingPengfei closed 7 years ago
Per your description, it sounds like host 35 exhibits intermittent failures w.r.t. libhdfs.so.
FWIW, we had to add some code to expand the Hadoop classpath for Spark 2.x + JNI. So, it sounds like either: 1) this is returning slightly different classpaths on host 35 (vs. other nodes), OR 2) there's something intermittently mangling the LD_LIBRARY_PATH on that node, OR 3) something else entirely???
For 1 and 2, you can add some debug logs inside your mnist_dist.map_fun() to see what each TF process is seeing as it's environment. For 3, you can send me the full yarn.logs at leewyang@gmail.com (preferrably with the log statements from 1 and 2) and I can take a look. And last but not least, if it's just that one node causing problems, perhaps re-imaging it as a new node might help (if that's a possibility).
Hi~ leewyang, I did 2 steps as you said. And I compared debug logs of the enviroment between success and fail using text-compare tool. They are exactly the same. I decide to rebuild that node as a new one. Thanks for your help~ btw, I have sent the logs to you.
I have another question: I have installed the tensorflowonspark as a module in python. The Python.zip "distribution" that can be shipped to the Spark executors at runtime, Why should I have to Install TensorFlowOnSpark and send tfspark.zip to the spark executors?
You only need to do one or the other... I've updated the instructions to try to clarify a bit. Thanks for the question/feedback.
@DingPengfei I have the same problem as you#125. Did you solve this problem? And if you did, could you share the solution with me? Thanks.
My problem only happened on a same node. I can't figure out why that happened until now. As leeyang suggested, I just rebuilded that node as a new one.
@DingPengfei Thanks for your reply. This is cause by the version of tensorflw. After reinstall it runs normally #125
That's great! I will try.
@xuande What are the versions of tensorflow and python that works for you? Were you able to save the model to HDFS? Thanks
Hi~ I ran my TensorflowOnSpark example MNIST on Yarn cluster
CentOS 7.3 Spark 1.6.0 (CDH) Python 2.7.12 Hadoop 2.6.0-cdh5.11.1 tensorflow (1.2.1) tensorflowonspark (1.0.0)
I encountered a weird problem. I ran the MNIST example about 50 times. half of them failed. After looking through all the logs from fails. I found:
I had checked the LD_LIBRARY_PATH on that "35" host.
Type1:
Type 2: