Error when trying to create a test cluster

ablekh commented 6 years ago

Running deploy.py script results in the following error. Any thoughts on why this could happen? I have made some changes in our cluster configuration: updated K8s major version to 1.8, renamed some attribute values, switched to US East 2 zone and downgraded K8s VM size from Standard_E4_v3 to Standard_E2_v3.

./deploy.py -s <SUBSCRIPTION_ID> -n sbdh-jh-v2 -d 1 -D 128

Traceback (most recent call last):
  File "./deploy.py", line 72, in <module>
    r = sp.check_output(cmd)
  File "/usr/lib64/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/usr/lib64/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['acs-engine', 'generate', 'sbdh-jh-v2/cluster.json']' returned non-zero exit status 1.

ablekh commented 6 years ago

@yuvipanda @ryanlovett Any advice/help on the above?

ryanlovett commented 6 years ago

Do you have the acs-engine error? If not, can you manually run acs-engine generate sbdh-jh-v2/cluster.json ?

ablekh commented 6 years ago

@ryanlovett Thanks for replying. The way I'm reading the output above is that acs-engine error had occurred (exit status 1?) ... Let me run the command and see what happens ... One moment ...

ablekh commented 6 years ago

@ryanlovett Here is the output from running the acs-engine command you suggested:

INFO[0000] Error returned by LoadContainerService: Unknown JSON tag nodeStatusUpdateFrequency
FATA[0000] error validating generateCmd: error parsing the api model: Unknown JSON tag nodeStatusUpdateFrequency

ablekh commented 6 years ago

OK, I figured it out. The issue is known (as reported here: https://github.com/Azure/acs-engine/issues/2029) and a workaround has been proposed here: https://github.com/Azure/acs-engine/issues/2062. I'm going to give the workaround a try and report back here.

ablekh commented 6 years ago

UPDATE: The workaround works fine - the error above is gone and the deployment seems to be in progress (knocking on wood ...). BTW, could you clarify for me the intent of deploying four managed disks by default?

ryanlovett commented 6 years ago

Glad you sorted it out, and thanks for pointing to the workaround!

This repo was our way of preparing Kubernetes on Azure for a particular kind of JupyterHub deployment. For that we decided to use NFS user storage on mirrored disks (4). This repo will eventually be obviated by AKS (if it hasn't already! -- I haven't tried it recently).

ablekh commented 6 years ago

My pleasure! Thank you for your helpful advice and clarifications! I noticed one problem within the current codebase - will send a pull request soon. I plan to use AKS for our third version of the deployment (will try with acs-engine and without). I'm wondering about whether it still might be a good idea to use acs-engine-driven AKS deployment for better customizability ...

P.S. Are you aware of the issue of slow speed of attaching disks on Azure-based K8s (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/628)? Yuvi recommended to look at Azure Files - I took a look and plan to explore other options further - do you have any experience with that? I'm leaning toward NFS, though.

ryanlovett commented 6 years ago

Yes, Azure File attachment was much faster in limited testing. You should check whether the IOPS and other parameters are sufficient for the size of your cluster though.

At the time I tested, there was a permission problem where the attached volume wasn't readable by jovyan. It was due to be fixed when Azure was to pull in a newer version of k8s. For one deployment, @yuvipanda came up with a workaround to this where he mounted the volume with the necessary mount options somewhere in extended spawner logic. I don't recall where that code is though.

ablekh commented 6 years ago

@ryanlovett Much appreciate your clarifications. Will review the information and share my progress on this.

ablekh commented 6 years ago

Oops ... Just got the following error when deploying the test cluster. Any thoughts?

Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
Warning: Permanently added '<HUB_FQDN>' (ECDSA) to the list of known hosts.
setup.bash                                                                              100%  553    84.2KB/s   00:00
.ansible.cfg                                                                            100%   37     5.4KB/s   00:00
Warning: Permanently added '<HUB_FQDN>' (ECDSA) to the list of known hosts.
Cloning into 'k8s-nfs-ansible'...
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
id_rsa                                                                                  100% 1679   274.7KB/s   00:00
id_rsa.pub                                                                              100%  406    59.9KB/s   00:00
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
gpg: keyring `/tmp/tmp8e29rx4a/secring.gpg' created
gpg: keyring `/tmp/tmp8e29rx4a/pubring.gpg' created
gpg: requesting key 7BB9C367 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmp8e29rx4a/trustdb.gpg: trustdb created
gpg: key 7BB9C367: public key "Launchpad PPA for Ansible, Inc." imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
 [ERROR]:

PLAY [nfs_servers] *************************************************************

TASK [Gathering Facts] *********************************************************
fatal: [nfsserver]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'nfsserver,10.240.0.65' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey).\r\n", "unreachable": true}
        to retry, use: --limit @/home/sbdhhub/k8s-nfs-ansible/playbook.retry

PLAY RECAP *********************************************************************
nfsserver                  : ok=0    changed=0    unreachable=1    failed=0

Traceback (most recent call last):
  File "./deploy.py", line 157, in <module>
    sp.check_call(cmd)
  File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', 'sbdh-jh-v2/id_rsa', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-o', 'User=<HUB_USER>', '<HUB_FQDN>', 'sudo bash bootstrap/setup.bash sbdh-jh-v2']' returned non-zero exit status 1.

ryanlovett commented 6 years ago

Looks like nfsserver isn't accepting the ssh key. Can you:

ssh -i sbdh-jh-v2/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=<HUB_USER> <HUB_FQDN>

And then run ssh nfsserver ?

Iirc, I ran into a problem at one point where the "nfsserver" name stopped getting propagated into the cluster. So I could ssh if I specified the IP but not "nfsserver". When that happened I completed setup.bash manually. However, since the output above indicates that nfsserver was resolvable I don't think that's the problem you're seeing.

ablekh commented 6 years ago

Thank you, Ryan. Will give it a try and let you know what happens ...

ablekh commented 6 years ago

UPDATE: Both commands were successful. Below is the redacted output.

Since I'm seeing that Ansible is complaining about nsfserver being "unreachable", it seems that there might be some significant enough delay before nsfserver host is visible/reachable to Ansible. Thus, I thought that it might worth a try to add some wait time somewhere (where?) ... What do you think?

$ ssh -i sbdh-jh-v2/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=<HUB_USER> <HUB_FQDN>
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.13.0-1012-azure x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

21 packages can be updated.
0 updates are security updates.

*** System restart required ***
<HUB_USER>@k8s-master-34182922-0:~$ ssh nfsserver
Welcome to Ubuntu 17.10 (GNU/Linux 4.13.0-39-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

1 package can be updated.
0 updates are security updates.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

<HUB_USER>@nfsserver:~$

ryanlovett commented 6 years ago

Seems like a timing issue -- as if nfsserver wasn't quite ready. To recover you should probably be able to run ssh -i sbdh-jh-v2/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=<HUB_USER> <HUB_FQDN> sudo bash bootstrap/setup.bash sbdh-jh-v2 which picks up where deploy left off.

ablekh commented 6 years ago

Thank you, will try. I'm glad we agree that it might be a timing issue. In this light, it seems that my suggestion of introducing a timeout makes sense, right? If so, where would we need to put relevant code?

ablekh commented 6 years ago

Just tried your suggestion for deployment recovery. Unfortunately, the longer command that you have suggested fails, while the original first one (just ssh) works fine every time. Hmm, this is odd ...

ryanlovett commented 6 years ago

Its failing in the middle of running the ansible playbook so right before https://github.com/yuvipanda/acs-jupyterhub/blob/master/bootstrap/setup.bash#L14 is one possibility.

You might first try adding the following to https://github.com/yuvipanda/acs-jupyterhub/blob/master/bootstrap/.ansible.cfg:

[ssh_connection] retries=3

ablekh commented 6 years ago

Thanks! Does your second suggestion (updating .ansible.cfg) require starting the deployment over or just re-issuing the ssh ... sudo bash bootstrap/setup.bash ...? Because I have just tried the latter and it failed.

ablekh commented 6 years ago

UPDATE: After some investigation, I have solved the previous issue, which was due to original value for the cluster admin user being specified not only in this repo (where I changed it in our fork), but also in the matching Ansible repo k8s-nfs-ansible. Once I realized that, I forked that repo, too, and updated the value to match our main (acs-engine) repo. Then I have removed the resource group and the _output directory, after which I restarted the deployment process from scratch. It went further ahead, however, now the next issue is the following (I guess, the one emphasized by label fatal). I have checked the playbook, but could not find anything obviously incorrect. Please let me know your thoughts. P.S. My config is 2 managed disks.

PLAY [nfs_servers] *************************************************************

TASK [Gathering Facts] *********************************************************
ok: [nfsserver]

TASK [install zfsutils-linux] **************************************************
[DEPRECATION WARNING]: State 'installed' is deprecated. Using state 'present'
instead.. This feature will be removed in version 2.9. Deprecation warnings can
 be disabled by setting deprecation_warnings=False in ansible.cfg.
ok: [nfsserver]

TASK [install nfs server packages] *********************************************
changed: [nfsserver]

TASK [add mount path to /etc/exports] ******************************************
changed: [nfsserver]

TASK [create zfs filesystem] ***************************************************
changed: [nfsserver]

TASK [gather zfs facts] ********************************************************
ok: [nfsserver]

TASK [check if zfs dataset is mounted] *****************************************
skipping: [nfsserver] => (item={u'setuid': u'on', u'relatime': u'off', u'referenced': u'96K', u'logicalused': u'40K', u'zoned': u'off', u'primarycache': u'all', u'logbias': u'latency', u'creation': u'Tue May  1  7:33 2018', u'sync': u'standard', u'snapdev': u'hidden', u'copies': u'1', u'sharenfs': u'off', u'acltype': u'off', u'sharesmb': u'off', u'reservation': u'none', u'mountpoint': u'/export/pool0/homes', u'casesensitivity': u'sensitive', u'utf8only': u'off', u'usedbysnapshots': u'0', u'compressratio': u'1.00x', u'rootcontext': u'none', u'atime': u'on', u'compression': u'off', u'overlay': u'off', u'xattr': u'on', u'dedup': u'off', u'snapshot_limit': u'none', u'aclinherit': u'restricted', u'defcontext': u'none', u'readonly': u'off', u'version': u'5', u'written': u'96K', u'normalization': u'none', u'filesystem_limit': u'none', u'type': u'filesystem', u'secondarycache': u'all', u'logicalreferenced': u'40K', u'available': u'123G', u'used': u'96K', u'exec': u'on', u'refquota': u'none', u'refcompressratio': u'1.00x', u'quota': u'none', u'snapshot_count': u'none', u'fscontext': u'none', u'vscan': u'off', u'canmount': u'on', u'usedbyrefreservation': u'0', u'mounted': u'yes', u'recordsize': u'128K', u'usedbychildren': u'0', u'usedbydataset': u'96K', u'name': u'pool0/homes', u'mlslabel': u'none', u'redundant_metadata': u'all', u'filesystem_count': u'none', u'devices': u'on', u'nbmand': u'off', u'refreservation': u'none', u'context': u'none', u'checksum': u'on', u'snapdir': u'hidden'})

TASK [create mount path directory] *********************************************
ok: [nfsserver]

TASK [set permissions on /export] **********************************************
changed: [nfsserver]

TASK [export filesystem] *******************************************************
changed: [nfsserver]

PLAY [nfs_clients] *************************************************************

TASK [Gathering Facts] *********************************************************
ok: [k8s-master-34182922-0]
ok: [k8s-pool1-34182922-0]

TASK [install nfs client packages] *********************************************
ok: [k8s-pool1-34182922-0]
changed: [k8s-master-34182922-0]

TASK [create mount path] *******************************************************
changed: [k8s-pool1-34182922-0]
changed: [k8s-master-34182922-0]

TASK [mount nfs filesystem] ****************************************************
changed: [k8s-pool1-34182922-0]
**fatal: [k8s-master-34182922-0]: FAILED! => {"changed": false, "msg": "Error mounting /mnt/homes: mount.nfs: access denied by server while mounting nfsserver:/export/pool0/homes\n"}**

TASK [make /data] **************************************************************
changed: [k8s-pool1-34182922-0]

TASK [make symlink to /data/homes] *********************************************
changed: [k8s-pool1-34182922-0]

TASK [blacklist rpc gss kernel modules] ****************************************
changed: [k8s-pool1-34182922-0] => (item=rpcsec_gss_krb5)
changed: [k8s-pool1-34182922-0] => (item=auth_rpcgss)
        to retry, use: --limit @/home/<HUB_USER>/k8s-nfs-ansible/playbook.retry

PLAY RECAP *********************************************************************
k8s-master-34182922-0      : ok=3    changed=2    unreachable=0    failed=1
k8s-pool1-34182922-0       : ok=7    changed=5    unreachable=0    failed=0
nfsserver                  : ok=9    changed=5    unreachable=0    failed=0

Traceback (most recent call last):
  File "./deploy.py", line 157, in <module>
    sp.check_call(cmd)
  File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', 'sbdh-jh-v2/id_rsa', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-o', 'User=<HUB_USER>', '<HUB_FQDN>', 'sudo bash bootstrap/setup.bash sbdh-jh-v2']' returned non-zero exit status 1.

ryanlovett commented 6 years ago

Hmm, can you check /etc/exports on the NFS server? It should contain sufficient authorization for the NFS clients to mount volumes.

ablekh commented 6 years ago

Here is the info on that directory:

<HUB_USER>@nfsserver:~$ ls -l /etc | grep exports
-rw-r--r-- 1 root root     486 May  1 07:33 exports

yuvipanda commented 6 years ago

I'm extremely happy that @ryanlovett is able to help, @ablekh :) I don't consider the contents of this repository to be anything that's continue to work for anyone, so am pleasantly surprised it is :)

yuvipanda commented 6 years ago

And I've put up a notice in the README saying so.

ablekh commented 6 years ago

@yuvipanda Thank you for your kind words. Yes, @ryanlovett has been extremely helpful (still waiting to hear from him on the NFS issue). I appreciate his kind help as well as your work on this repo (I figured that, instead of reinventing the wheel, I'd rather be closer to the top by standing on the shoulders of giants :-).

Using your repo is part of my version 2 of our deployment. As I said above, I plan to use AKS in version 3 (will try with and without acs-engine). On the other fronts, I will be working on satisfying our requirements in terms of supporting different authentication mechanisms, persistence options, Jupyter kernels (currently working on Octave), adding WrapSpawner menu (this will help with reducing the size of images) and other.

ablekh commented 6 years ago

UPDATE: After an intensive and extensive investigation, I have figured out what is the underlying problem for this (latest) NFS access issue. Basically, for some reason (will take a look at that, but, perhaps, one of you knows the reason), the value for the subnet eligible for NFS access (in /etc/exports) ended up not matching the subnet value for the real IP address of the client host. This actually can be easily seen by comparing the relevant outputs from /etc/exports and verbose output from the test mount command (or from dmesg).

<HUB_USER>@nfsserver:~$ cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
#               to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
#
/export/pool0/homes 10.240.0.0/16(rw,sync,no_subtree_check,all_squash,anonuid=1000,anongid=1000)

<HUB_USER>@k8s-master-34182922-0:~$ sudo mount -t nfs -v nfsserver:/export/pool0/homes /tmp/nfstest
mount.nfs: timeout set for Wed May  2 05:09:06 2018
mount.nfs: trying text-based options 'vers=4,addr=10.240.0.65,clientaddr=10.255.255.5'
mount.nfs: mount(2): Permission denied
mount.nfs: access denied by server while mounting nfsserver:/export/pool0/homes

There is a clear discrepancy between relevant subnet values: 10.240.0 versus 10.255.255.

Therefore, an immediate fix (which I have done and successfully tested) is to update the /etc/exports file with correct subnet value and re-export the shared directory. However, for an automated solution, there is a need to figure out what causes the subnet values mismatch. Any help with this will be much appreciated.

ablekh commented 6 years ago

Finding the code that sets the incorrect (non-matching) subnet value was easier than I thought. It is set as a variable in the Ansible playbook of the k8s-nfs-ansible repo. Most likely, the best way to fix it (for reliable automation) is to pass relevant value received from Azure CLI to Ansible playbook (within deploy.py script).

ablekh commented 6 years ago

Hmm ... the command az network vnet list -g sbdh_jh_v1 results in seeing the "incorrect" subnet value:

...
    "subnets": [
      {
        "additionalProperties": {},
        "addressPrefix": "10.240.0.0/16",
...

ablekh commented 6 years ago

@ryanlovett @yuvipanda I would appreciate if you could take a quick look at my most recent 3 comments above and share your thoughts. I don't have much expertise in how Kubernetes assigns subnet prefixes etc.

ryanlovett commented 6 years ago

Hey @ablekh , I'm not sure why acs isn't using the correct subnet like it used to. This looks like an issue with azure or acs engine and not k8s on Azure. I suppose one could set the export to a /8 subnet so that it encompasses both the intended and actual nfs client range. That assumes there are no risks in doing so.

ablekh commented 6 years ago

Hey @ryanlovett , I appreciate your quick feedback to my request (not sure whether @yuvipanda has something to add to your comment, but he is always welcome to). I will explore this topic a bit further.

yuvipanda / acs-jupyterhub

Error when trying to create a test cluster #10