Open stvoutsin opened 9 months ago
It looks like HDFS/Yarn were not started correctly in this deploy:
hdfs dfsadmin -report
yarn areport: Call From zeppelin/10.10.0.194 to master01:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Issue may be related to #1304
Two issues, both network connection related. 1) The failure to mount a CephFS share with the message "mount error: no mds server is up or the cluster is laggy" 2) A "connection refused" error trying to connect from one VM (zeppelin) to another (master) within our deployment.
The Linux user error is caused directly by (1) if CephFS failed to mount /home/Surbron
then /home/Surbron/.ssh
won't exist causing "No such file or directory".
The connection refused error suggests that the HDFS service wasn't running rather than a network issue interrupting the connection. So there is probably a different underlying cause somewhere.
We have seen similar CephFS mount failures, so we probably have enough to report them to Cambridge tech support. Can you provide details of date and time when the CephFS mount error occurred?
Is this a duplicate of #1268 ?
Both are CephFS related but seem to have different error messages.
Can you take a look at the output from the Ansible mount task in /tmp/test-users.json
and see if there is more detail about the error.
Ceph share failure:
HDFS Error
Linux account error