User test account import is failing

stvoutsin commented 9 months ago

Ceph share failure:

    "msg": "Error mounting /home/Surbron: mount error: no mds server is up or the cluster is laggy\n"
    ...
    "debug": {
        "script": "create-ceph-share.sh",
        "result": "FAIL",
        "messages": ["PASS: Share [iris-gaia-red-home-Surbron] created [cf338282-ab08-415e-a868-109066e0513f][creating]","PASS: Share [iris-gaia-red-home-Surbron][cf338282-ab08-415e-a868-109066e0513f] status [available]","PASS: Share [iris-gaia-red-home-Surbron][cf338282-ab08-415e-a868-109066e0513f] [ro] access created","PASS: Share [iris-gaia-red-home-Surbron][cf338282-ab08-415e-a868-109066e0513f] [rw] access created","FAIL: Ansible mount playbook failed"]
        }
    }

HDFS Error

    "hdfsspace": 
    mkdir: Call From zeppelin/10.10.0.194 to master01:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    chown: Call From zeppelin/10.10.0.194 to master01:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    {
    "path":  "/albert/Surbron",
    "owner": "Surbron",
    "group": "supergroup",
    "debug": {
        "script": "create-hdfs-space.sh",
        "result": "FAIL",
        "messages": ["FAIL: hdfs mkdir [/albert/Surbron] failed","FAIL: hdfs chown [/albert/Surbron] failed"]
        }
    }

Linux account error

    linuxuser: 
    touch: cannot touch '/home/Surbron/.ssh/authorized_keys': No such file or directory
    grep: /home/Surbron/.ssh/authorized_keys: No such file or directory
    /opt/aglais/bin/create-linux-user.sh: line 134: [: -eq: unary operator expected
    sed: can't read /home/Surbron/.ssh/authorized_keys: No such file or directory
    sed: can't read /home/Surbron/.ssh/authorized_keys: No such file or directory
    ls: cannot access '/home/Surbron': No such file or directory
    ...
    "debug": {
        "script": "create-linux-user.sh",
        "result": "FAIL",
        "messages": ["PASS: adduser [Surbron] done","FAIL: mkdir [/home/Surbron/.ssh] failed","mkdir: cannot create directory ‘/home/Surbron/.ssh’: No such file or directory","PASS: updated public keys for [Surbron] (sed)","FAIL: chown [/home/Surbron] failed","chown: cannot access '/home/Surbron': No such file or directory","FAIL: chmod [/home/Surbron] failed","chmod: cannot access '/home/Surbron': No such file or directory","FAIL: chown [/home/Surbron/.ssh] failed","chown: cannot access '/home/Surbron/.ssh': No such file or directory","FAIL: chmod [/home/Surbron/.ssh] failed","chmod: cannot access '/home/Surbron/.ssh': No such file or directory"]
        }
    }

stvoutsin commented 9 months ago

It looks like HDFS/Yarn were not started correctly in this deploy:

hdfs dfsadmin -report

yarn areport: Call From zeppelin/10.10.0.194 to master01:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

stvoutsin commented 9 months ago

Issue may be related to #1304

Zarquan commented 9 months ago

Two issues, both network connection related. 1) The failure to mount a CephFS share with the message "mount error: no mds server is up or the cluster is laggy" 2) A "connection refused" error trying to connect from one VM (zeppelin) to another (master) within our deployment.

The Linux user error is caused directly by (1) if CephFS failed to mount /home/Surbron then /home/Surbron/.ssh won't exist causing "No such file or directory".

The connection refused error suggests that the HDFS service wasn't running rather than a network issue interrupting the connection. So there is probably a different underlying cause somewhere.

We have seen similar CephFS mount failures, so we probably have enough to report them to Cambridge tech support. Can you provide details of date and time when the CephFS mount error occurred?

Zarquan commented 9 months ago

Is this a duplicate of #1268 ? Both are CephFS related but seem to have different error messages. Can you take a look at the output from the Ansible mount task in /tmp/test-users.json and see if there is more detail about the error.

wfau / gaia-dmp

User test account import is failing #1305