Open ablekh opened 6 years ago
@yuvipanda @ryanlovett Any advice/help on the above?
Do you have the acs-engine error? If not, can you manually run acs-engine generate sbdh-jh-v2/cluster.json
?
@ryanlovett Thanks for replying. The way I'm reading the output above is that acs-engine error had occurred (exit status 1?) ... Let me run the command and see what happens ... One moment ...
@ryanlovett Here is the output from running the acs-engine command you suggested:
INFO[0000] Error returned by LoadContainerService: Unknown JSON tag nodeStatusUpdateFrequency
FATA[0000] error validating generateCmd: error parsing the api model: Unknown JSON tag nodeStatusUpdateFrequency
OK, I figured it out. The issue is known (as reported here: https://github.com/Azure/acs-engine/issues/2029) and a workaround has been proposed here: https://github.com/Azure/acs-engine/issues/2062. I'm going to give the workaround a try and report back here.
UPDATE: The workaround works fine - the error above is gone and the deployment seems to be in progress (knocking on wood ...). BTW, could you clarify for me the intent of deploying four managed disks by default?
Glad you sorted it out, and thanks for pointing to the workaround!
This repo was our way of preparing Kubernetes on Azure for a particular kind of JupyterHub deployment. For that we decided to use NFS user storage on mirrored disks (4). This repo will eventually be obviated by AKS (if it hasn't already! -- I haven't tried it recently).
My pleasure! Thank you for your helpful advice and clarifications! I noticed one problem within the current codebase - will send a pull request soon. I plan to use AKS for our third version of the deployment (will try with acs-engine
and without). I'm wondering about whether it still might be a good idea to use acs-engine
-driven AKS deployment for better customizability ...
P.S. Are you aware of the issue of slow speed of attaching disks on Azure-based K8s (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/628)? Yuvi recommended to look at Azure Files - I took a look and plan to explore other options further - do you have any experience with that? I'm leaning toward NFS, though.
Yes, Azure File attachment was much faster in limited testing. You should check whether the IOPS and other parameters are sufficient for the size of your cluster though.
At the time I tested, there was a permission problem where the attached volume wasn't readable by jovyan. It was due to be fixed when Azure was to pull in a newer version of k8s. For one deployment, @yuvipanda came up with a workaround to this where he mounted the volume with the necessary mount options somewhere in extended spawner logic. I don't recall where that code is though.
@ryanlovett Much appreciate your clarifications. Will review the information and share my progress on this.
Oops ... Just got the following error when deploying the test cluster. Any thoughts?
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
Warning: Permanently added '<HUB_FQDN>' (ECDSA) to the list of known hosts.
setup.bash 100% 553 84.2KB/s 00:00
.ansible.cfg 100% 37 5.4KB/s 00:00
Warning: Permanently added '<HUB_FQDN>' (ECDSA) to the list of known hosts.
Cloning into 'k8s-nfs-ansible'...
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
id_rsa 100% 1679 274.7KB/s 00:00
id_rsa.pub 100% 406 59.9KB/s 00:00
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
gpg: keyring `/tmp/tmp8e29rx4a/secring.gpg' created
gpg: keyring `/tmp/tmp8e29rx4a/pubring.gpg' created
gpg: requesting key 7BB9C367 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmp8e29rx4a/trustdb.gpg: trustdb created
gpg: key 7BB9C367: public key "Launchpad PPA for Ansible, Inc." imported
gpg: Total number processed: 1
gpg: imported: 1 (RSA: 1)
[ERROR]:
PLAY [nfs_servers] *************************************************************
TASK [Gathering Facts] *********************************************************
fatal: [nfsserver]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'nfsserver,10.240.0.65' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey).\r\n", "unreachable": true}
to retry, use: --limit @/home/sbdhhub/k8s-nfs-ansible/playbook.retry
PLAY RECAP *********************************************************************
nfsserver : ok=0 changed=0 unreachable=1 failed=0
Traceback (most recent call last):
File "./deploy.py", line 157, in <module>
sp.check_call(cmd)
File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', 'sbdh-jh-v2/id_rsa', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-o', 'User=<HUB_USER>', '<HUB_FQDN>', 'sudo bash bootstrap/setup.bash sbdh-jh-v2']' returned non-zero exit status 1.
Looks like nfsserver isn't accepting the ssh key. Can you:
ssh -i sbdh-jh-v2/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=<HUB_USER> <HUB_FQDN>
And then run ssh nfsserver
?
Iirc, I ran into a problem at one point where the "nfsserver" name stopped getting propagated into the cluster. So I could ssh if I specified the IP but not "nfsserver". When that happened I completed setup.bash manually. However, since the output above indicates that nfsserver was resolvable I don't think that's the problem you're seeing.
Thank you, Ryan. Will give it a try and let you know what happens ...
UPDATE: Both commands were successful. Below is the redacted output.
Since I'm seeing that Ansible is complaining about nsfserver
being "unreachable", it seems that there might be some significant enough delay before nsfserver
host is visible/reachable to Ansible. Thus, I thought that it might worth a try to add some wait time somewhere (where?) ... What do you think?
$ ssh -i sbdh-jh-v2/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=<HUB_USER> <HUB_FQDN>
Warning: Permanently added '<HUB_FQDN>,52.225.134.11' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.13.0-1012-azure x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
21 packages can be updated.
0 updates are security updates.
*** System restart required ***
<HUB_USER>@k8s-master-34182922-0:~$ ssh nfsserver
Welcome to Ubuntu 17.10 (GNU/Linux 4.13.0-39-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
1 package can be updated.
0 updates are security updates.
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
<HUB_USER>@nfsserver:~$
Seems like a timing issue -- as if nfsserver wasn't quite ready. To recover you should probably be able to run
ssh -i sbdh-jh-v2/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=<HUB_USER> <HUB_FQDN> sudo bash bootstrap/setup.bash sbdh-jh-v2
which picks up where deploy left off.
Thank you, will try. I'm glad we agree that it might be a timing issue. In this light, it seems that my suggestion of introducing a timeout makes sense, right? If so, where would we need to put relevant code?
Just tried your suggestion for deployment recovery. Unfortunately, the longer command that you have suggested fails, while the original first one (just ssh) works fine every time. Hmm, this is odd ...
Its failing in the middle of running the ansible playbook so right before https://github.com/yuvipanda/acs-jupyterhub/blob/master/bootstrap/setup.bash#L14 is one possibility.
You might first try adding the following to https://github.com/yuvipanda/acs-jupyterhub/blob/master/bootstrap/.ansible.cfg:
[ssh_connection] retries=3
Thanks! Does your second suggestion (updating .ansible.cfg
) require starting the deployment over or just re-issuing the ssh ... sudo bash bootstrap/setup.bash ...
? Because I have just tried the latter and it failed.
UPDATE: After some investigation, I have solved the previous issue, which was due to original value for the cluster admin user being specified not only in this repo (where I changed it in our fork), but also in the matching Ansible repo k8s-nfs-ansible
. Once I realized that, I forked that repo, too, and updated the value to match our main (acs-engine) repo. Then I have removed the resource group and the _output
directory, after which I restarted the deployment process from scratch. It went further ahead, however, now the next issue is the following (I guess, the one emphasized by label fatal). I have checked the playbook, but could not find anything obviously incorrect. Please let me know your thoughts. P.S. My config is 2 managed disks.
PLAY [nfs_servers] *************************************************************
TASK [Gathering Facts] *********************************************************
ok: [nfsserver]
TASK [install zfsutils-linux] **************************************************
[DEPRECATION WARNING]: State 'installed' is deprecated. Using state 'present'
instead.. This feature will be removed in version 2.9. Deprecation warnings can
be disabled by setting deprecation_warnings=False in ansible.cfg.
ok: [nfsserver]
TASK [install nfs server packages] *********************************************
changed: [nfsserver]
TASK [add mount path to /etc/exports] ******************************************
changed: [nfsserver]
TASK [create zfs filesystem] ***************************************************
changed: [nfsserver]
TASK [gather zfs facts] ********************************************************
ok: [nfsserver]
TASK [check if zfs dataset is mounted] *****************************************
skipping: [nfsserver] => (item={u'setuid': u'on', u'relatime': u'off', u'referenced': u'96K', u'logicalused': u'40K', u'zoned': u'off', u'primarycache': u'all', u'logbias': u'latency', u'creation': u'Tue May 1 7:33 2018', u'sync': u'standard', u'snapdev': u'hidden', u'copies': u'1', u'sharenfs': u'off', u'acltype': u'off', u'sharesmb': u'off', u'reservation': u'none', u'mountpoint': u'/export/pool0/homes', u'casesensitivity': u'sensitive', u'utf8only': u'off', u'usedbysnapshots': u'0', u'compressratio': u'1.00x', u'rootcontext': u'none', u'atime': u'on', u'compression': u'off', u'overlay': u'off', u'xattr': u'on', u'dedup': u'off', u'snapshot_limit': u'none', u'aclinherit': u'restricted', u'defcontext': u'none', u'readonly': u'off', u'version': u'5', u'written': u'96K', u'normalization': u'none', u'filesystem_limit': u'none', u'type': u'filesystem', u'secondarycache': u'all', u'logicalreferenced': u'40K', u'available': u'123G', u'used': u'96K', u'exec': u'on', u'refquota': u'none', u'refcompressratio': u'1.00x', u'quota': u'none', u'snapshot_count': u'none', u'fscontext': u'none', u'vscan': u'off', u'canmount': u'on', u'usedbyrefreservation': u'0', u'mounted': u'yes', u'recordsize': u'128K', u'usedbychildren': u'0', u'usedbydataset': u'96K', u'name': u'pool0/homes', u'mlslabel': u'none', u'redundant_metadata': u'all', u'filesystem_count': u'none', u'devices': u'on', u'nbmand': u'off', u'refreservation': u'none', u'context': u'none', u'checksum': u'on', u'snapdir': u'hidden'})
TASK [create mount path directory] *********************************************
ok: [nfsserver]
TASK [set permissions on /export] **********************************************
changed: [nfsserver]
TASK [export filesystem] *******************************************************
changed: [nfsserver]
PLAY [nfs_clients] *************************************************************
TASK [Gathering Facts] *********************************************************
ok: [k8s-master-34182922-0]
ok: [k8s-pool1-34182922-0]
TASK [install nfs client packages] *********************************************
ok: [k8s-pool1-34182922-0]
changed: [k8s-master-34182922-0]
TASK [create mount path] *******************************************************
changed: [k8s-pool1-34182922-0]
changed: [k8s-master-34182922-0]
TASK [mount nfs filesystem] ****************************************************
changed: [k8s-pool1-34182922-0]
**fatal: [k8s-master-34182922-0]: FAILED! => {"changed": false, "msg": "Error mounting /mnt/homes: mount.nfs: access denied by server while mounting nfsserver:/export/pool0/homes\n"}**
TASK [make /data] **************************************************************
changed: [k8s-pool1-34182922-0]
TASK [make symlink to /data/homes] *********************************************
changed: [k8s-pool1-34182922-0]
TASK [blacklist rpc gss kernel modules] ****************************************
changed: [k8s-pool1-34182922-0] => (item=rpcsec_gss_krb5)
changed: [k8s-pool1-34182922-0] => (item=auth_rpcgss)
to retry, use: --limit @/home/<HUB_USER>/k8s-nfs-ansible/playbook.retry
PLAY RECAP *********************************************************************
k8s-master-34182922-0 : ok=3 changed=2 unreachable=0 failed=1
k8s-pool1-34182922-0 : ok=7 changed=5 unreachable=0 failed=0
nfsserver : ok=9 changed=5 unreachable=0 failed=0
Traceback (most recent call last):
File "./deploy.py", line 157, in <module>
sp.check_call(cmd)
File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', 'sbdh-jh-v2/id_rsa', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-o', 'User=<HUB_USER>', '<HUB_FQDN>', 'sudo bash bootstrap/setup.bash sbdh-jh-v2']' returned non-zero exit status 1.
Hmm, can you check /etc/exports on the NFS server? It should contain sufficient authorization for the NFS clients to mount volumes.
Here is the info on that directory:
<HUB_USER>@nfsserver:~$ ls -l /etc | grep exports
-rw-r--r-- 1 root root 486 May 1 07:33 exports
I'm extremely happy that @ryanlovett is able to help, @ablekh :) I don't consider the contents of this repository to be anything that's continue to work for anyone, so am pleasantly surprised it is :)
And I've put up a notice in the README saying so.
@yuvipanda Thank you for your kind words. Yes, @ryanlovett has been extremely helpful (still waiting to hear from him on the NFS issue). I appreciate his kind help as well as your work on this repo (I figured that, instead of reinventing the wheel, I'd rather be closer to the top by standing on the shoulders of giants :-).
Using your repo is part of my version 2 of our deployment. As I said above, I plan to use AKS in version 3 (will try with and without acs-engine
). On the other fronts, I will be working on satisfying our requirements in terms of supporting different authentication mechanisms, persistence options, Jupyter kernels (currently working on Octave), adding WrapSpawner menu (this will help with reducing the size of images) and other.
UPDATE: After an intensive and extensive investigation, I have figured out what is the underlying problem for this (latest) NFS access issue. Basically, for some reason (will take a look at that, but, perhaps, one of you knows the reason), the value for the subnet eligible for NFS access (in /etc/exports
) ended up not matching the subnet value for the real IP address of the client host. This actually can be easily seen by comparing the relevant outputs from /etc/exports
and verbose output from the test mount command (or from dmesg
).
<HUB_USER>@nfsserver:~$ cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes gss/krb5i(rw,sync,no_subtree_check)
#
/export/pool0/homes 10.240.0.0/16(rw,sync,no_subtree_check,all_squash,anonuid=1000,anongid=1000)
<HUB_USER>@k8s-master-34182922-0:~$ sudo mount -t nfs -v nfsserver:/export/pool0/homes /tmp/nfstest
mount.nfs: timeout set for Wed May 2 05:09:06 2018
mount.nfs: trying text-based options 'vers=4,addr=10.240.0.65,clientaddr=10.255.255.5'
mount.nfs: mount(2): Permission denied
mount.nfs: access denied by server while mounting nfsserver:/export/pool0/homes
There is a clear discrepancy between relevant subnet values: 10.240.0 versus 10.255.255.
Therefore, an immediate fix (which I have done and successfully tested) is to update the /etc/exports
file with correct subnet value and re-export the shared directory. However, for an automated solution, there is a need to figure out what causes the subnet values mismatch. Any help with this will be much appreciated.
Finding the code that sets the incorrect (non-matching) subnet value was easier than I thought. It is set as a variable in the Ansible playbook of the k8s-nfs-ansible
repo. Most likely, the best way to fix it (for reliable automation) is to pass relevant value received from Azure CLI to Ansible playbook (within deploy.py
script).
Hmm ... the command az network vnet list -g sbdh_jh_v1
results in seeing the "incorrect" subnet value:
...
"subnets": [
{
"additionalProperties": {},
"addressPrefix": "10.240.0.0/16",
...
@ryanlovett @yuvipanda I would appreciate if you could take a quick look at my most recent 3 comments above and share your thoughts. I don't have much expertise in how Kubernetes assigns subnet prefixes etc.
Hey @ablekh , I'm not sure why acs isn't using the correct subnet like it used to. This looks like an issue with azure or acs engine and not k8s on Azure. I suppose one could set the export to a /8 subnet so that it encompasses both the intended and actual nfs client range. That assumes there are no risks in doing so.
Hey @ryanlovett , I appreciate your quick feedback to my request (not sure whether @yuvipanda has something to add to your comment, but he is always welcome to). I will explore this topic a bit further.
Running
deploy.py
script results in the following error. Any thoughts on why this could happen? I have made some changes in our cluster configuration: updated K8s major version to 1.8, renamed some attribute values, switched to US East 2 zone and downgraded K8s VM size fromStandard_E4_v3
toStandard_E2_v3
../deploy.py -s <SUBSCRIPTION_ID> -n sbdh-jh-v2 -d 1 -D 128