xcp-ng / xcp

Entry point for issues and wiki. Also contains some scripts and sources.
https://xcp-ng.org
1.26k stars 74 forks source link

Use NFS hard mount instead of soft mount to avoid RO VMs (or offer option)? #334

Open stormi opened 4 years ago

stormi commented 4 years ago

See proposal and testimony from user on forum: https://xcp-ng.org/forum/post/21940

We may also consider changing the default timeout options.

olivierlambert commented 4 years ago

I think it might be interesting to ask the question to Citrix storage guys. We should create an XSO to get their opinion and maybe their reasons about their current choices.

ghost commented 4 years ago

Perhaps I can suggest to always use a unique fsid= export option for each exported path on the nfs server. This ought to be documented in the docs and wiki :)

ezaton commented 4 years ago

The thing is that if NFS is served by a cluster (example - PaceMaker), failover event will work flawlessly if NFS is mounted with 'hard' option on the XenServer. Otherwise, VMs will experience a (short) disk loss and the Linux ones will get, by default, a read-only filesystem. The simple workaround is to edit /opt/xensource/sm/nfs.py and modify the line:

options = "soft,proto=%s,vers=%s" % ( to: options = "hard,proto=%s,vers=%s" % (

This is an ugly workaround, but it allows VMs to live, which is more important that the beauty of the hack.

ghost commented 4 years ago

I believe it is possible to add custom NFS mount options when adding a new SR through XOA. Have you tested this?

ezaton commented 4 years ago

Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.

olivierlambert commented 4 years ago

Yes, that's why it would require a XAPI modification for this. That's doable :)

I think we should keep the default behavior, but allow an override: this will let people who want to test, to test it.

In theory, we should:

That should be it. @ezaton do you want to contribute?

ezaton commented 4 years ago

I am not sure I have the Python know-how, but I will make an effort during the next few days. This is a major thing I am carrying with me since XS version 6.1 or so. These were my early NFS clusters days. Nowadays - I have so many NFS clusters in so many locations. So - yeah. I want to contribute. I will see that I can actually do it.

Thanks!

olivierlambert commented 4 years ago

Okay so IIRC, you might indeed check how NFS version is passed down to the driver (from XAPI to the NFS Python file). It's a good start to understand how it works, and then do the same for the hard/soft mount thing :)

edit: @Wescoeur knows a lot about SMAPIv1, so he might assist you on this (if you have questions).

ghost commented 4 years ago

Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.

I thought subsequent mount-options override previous mount options. This is how we can add nfsver=4.1 for example, isn't it. I haven't tried, but it might be worth trying.

ezaton commented 4 years ago

This is a quote from 'man 5 nfs':

  soft / hard    Determines the recovery behavior of the NFS client after an NFS request times out.  If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely.  If the soft option is specified, then the NFS client  fails

an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

                  NB:  A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity.  Using NFS over TCP or increasing the value of the retrans option may
                  mitigate some of the risks of using the soft option.

Look at the comment. I believe that hard should be the default - at least for regular SR. ISO-SR is another thing. I have just forked the code. I will see if I can modify it without exceeding my talent :-)

nagilum99 commented 4 years ago

Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.

Maybe increasing that value could be a less intrustive option and could be supplied without being ignored?

ezaton commented 4 years ago

These are meant to mitigate (some of) the problems caused by soft mount, instead of just mounting 'hard'. Look - when it's your virtual machine there, you do not want a momentary network disruption to kill your VMs. The safety of your virtual machines is the key requirement. Soft mount just doesn't provide it.

ezaton commented 4 years ago

I have edited nfs.py and NFSSR.py and created a pull request here: https://github.com/xapi-project/sm/pull/485

stormi commented 4 years ago

Thanks. I think you need to add context and explain why hard would be better than soft and what tests you did to have a chance of getting it merged.

ezaton commented 4 years ago

I will add all these details in the pull request.

ghost commented 4 years ago

Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.

I just tried in XOA to create a new SR with the "hard" mount option. Seems to stick when looking at the output from mount.

image

# mount
example.com:/media/nfs_ssd/3ec42c2f-552c-222f-3d46-4f98613fe2e1 on /run/sr-mount/3ec42c2f-552c-222f-3d46-4f98613fe2e1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.1.10,local_lock=none,addr=192.168.1.2)
olivierlambert commented 4 years ago

@Gatak if it's the case it's even easier :D

Can you double check it's the correct hard behavior?

ezaton commented 4 years ago

This is a change of behaviour from what I am remembering, however - I have just tested it, and this is true. Consistent across reboots and across detach/reattach - so my patch is (partially) redundant. However - I believe that 'hard' should be the default for VM NFS SRs.

ghost commented 4 years ago

I believe that 'hard' should be the default for VM NFS SRs.

Yes, based on the documentation provided it does seem the safest option.

olivierlambert commented 4 years ago

Yes, but you can't decide to do this change for everyone without a consensus. We'll talk more with Citrix team to understand their original choice.

What we can do in XO: expose a menu that select "hard" by default. This will encourage hard by default without changing it into the platform directly.

Does this sound reasonable for you?

ghost commented 4 years ago

Yes, but you can't decide to do this change for everyone without a consensus. We'll talk more with Citrix team to understand their original choice.

Sounds good. Many use soft because you could not abort/unmount a hard mounted NFS share. But this may be old truths..

What we can do in XO: expose a menu that select "hard" by default. This will encourage hard by default without changing it into the platform directly.

I think it is important to mention that the NFS export should use the fsid* option to create a stable export filesystem ID. Otherwise the ID might change on reboot, which will prevent a share from being re-connected.

* https://linux.die.net/man/5/exports

olivierlambert commented 4 years ago

What about NFS HA? (regarding fsid)

ezaton commented 4 years ago

What about NFS HA? (regarding fsid)

NFS HA maintains fsid. If you setup an NFS cluster, you handle your fsid, or else, it doesn't work very well. For stand-alone systems, the fsid is derived from the device id, but not for clusters.

nackstein commented 4 years ago

I wrote some condideration on the forum thread about this issue and report here the post important one. It seems that nfs.py already support user options and those got appended to the default. the mount command kept the last option so if default is soft and user append hard: soft,hard = hard. same for timeo and retrans. the linux VM that goes in readonly is probably due to a default in ubuntu. there is an option in the superblock of ext2/3/4 about the behavior if errors are encountered. RHEL on the otherside does not remount in read-only and will contiue (retry) do perform I/O on the disk. it's to be verified if the error is propagated to the userspace or it stay at fs level inside the VM.

using hard as default is risky in my opinion. I have to say that usually on servers i set hard,intr in order to protect the poor written application software from receiving I/O error and with the intr option still be able to kill the process if I need to umount the fs. I say it's risky because if you use a lot of different NFS storage and only one goes down for long period you will get a semi-frozen dom0. it's to be verified what happen to xapi and normal operation, if you are able to ignore the 1 broken NFS SR and continue working or the whole xapi or other deamon running on dom0 get stuck at listing mount points or accessing the one broken SR. I think nobody want to reboot an host because the NFS SR for the iso files is down. for short downtime raising the NFS mount option retrans (default 3) or timeo (default 100) could be enough. the ideal solution is to have the single VM retrying on a soft mount without going read-only so it's easy to manually recover the fs without reboot of the host for stale nfs mount point. it seems that windows have a nice default behavior and RHEL should too. the problem could be limited to ubuntu o other distro (to be verified)

ezaton commented 4 years ago

he linux VM that goes in readonly is probably due to a default in ubuntu. there is an option in the superblock of ext2/3/4 about the behavior if errors are encountered. RHEL on the otherside does not remount in read-only and will contiue (retry) do perform I/O on the disk. it's to be verified if the error is propagated to the userspace or it stay at fs level inside the VM.

This is incorrect. All Linux servers I have had the pleasure of working with - RHEL5/6/7, Centos, Oracle Linux, Ubuntu and some more - all of them mount by default with the directive onerror=readonly. You have to explicitly change this behaviour for your Linux to not fail(!) when NFS performs failover with soft mount.

Xapi - and SM-related tasks, are handled independently per-SR - check the logs. I agree that ISO SR should remain soft (although this can crash VMs, but this is less of a problem, because the ISO is originally read-only), so my patch (and the proposed change to the GUI) is to have 'hard' mount option for VM data disks, and 'soft' for ISO SR.

ghost commented 4 years ago

usually on servers i set hard,intr in order to protect the poor written application software from receiving I/O error and with the intr option still be able to kill the process if I need to umount the fs.

According to https://linux.die.net/man/5/nfs the intr mount option is deprecated. However it should still be possible to kill a process. In this case it must be one on the Xen services reading from the stale NFS share. Not sure how possible it is to kill. Is it tapdisk?

I did one test yesterday with a Windows server VM on a hard mounted NFS server that i took offline for ~30 minutes. The VM froze and i got NFS timeouts on the XCP-ng server dmesg, but once i started the NFS server the freeze stopped and things went back to normal.

This did not previously work when i had the soft mount option and had not specified fsid export option. Then the XCP-ng would not reconnect and wait forever with a stale mount.

nackstein commented 4 years ago

I made a test with ubuntu server 19.10. installed with defaults setting without LVM. the fs is mount with the continue behavior by default (as I see on a RHEL7) root@ubuntu01:~# tune2fs -l /dev/xvda2 |grep -i behav Errors behavior: Continue

I tested with a script that update a file every second on the VM. the test consist in exportfs -uav on the NFS server to turn down the share and exportfs -rv to bring it online again. with default SR option soft,timeo=100,retrans=3 the VM does not detect a problem for about 1 minute (I didn't precisily measured time). after 5 minutes of downtime the root fs get remounted read-only. on the xcp host I see that df command block for about 10/20 seconds and return the output. once the NFS come back it's almost istantly mounted.

I repeated the test with retrans=360, I expected that the client didn't received error for a heck of time but I was wrong. after about 5 minutes the root fs of the VM get remounted read-only.

I investigated on the timeout parameter of the disk normaly in /sys/block/sd*/device/timeout but it seems that the xen disk does not export this parameter. I was confident that non having a timeout a default infinite wait was implemented but now I think I was wrong.

I still have to understand what really happen: if the VM get the I/O error from the dom0 and then remount read-only before I expected it (timeo=100 and retrans=360 should retry for about 1 hour) or if the timeout is internal in the kernel of the VM and once exceeded the fs is remounted read-only. the first case means for some reason the NFS paramenter are not enforced while the second case means that even with hard mount you should see the problem. so right now I miss something.

nackstein commented 4 years ago

some more test. It turn out that one possible problem was how I conducted the test. I user unxeport/export and this seems to trigger the error reporting to userspace even before the timeout expire. I tried with timeo=3000,retrans=10 but after about 50 seconds the VM mounted read-only and a ls command on xcp host returned error after few seconds instead of waiting. this with unexport/export.

I now tried with null routing as suggested on the forum, ip route add <xcp host ip/32> via 127.0.0.1 dev lo to block all traffic between NFS server and xcp host and then ip route del to rollback. now after 5 minutes the VM does not get error with timeo=3000,retrans=10 and commands on the host like df block, the NFS mount honor the configured timeout.

I'm going to retest with timeo=100,retrans=360 to be sure it works and to verify how the tcp timeouts interact.

I think this tell us 2 things: 1) the xvda disk does not have timeouts 2) in case of ip failover on the nfs server it should be safer to create the exports and then configure the ip then viceversa. this to let the share appear from the first time the vip is reachable again and avoid error to be propagated to userspace

stormi commented 4 years ago

Just a quick word to say that this discussion is very interesting, whatever what the outcome will be. I'm following it closely.

ezaton commented 4 years ago

in case of ip failover on the nfs server it should be safer to create the exports and then configure the ip then viceversa. this to let the share appear from the first time the vip is reachable again and avoid error to be propagated to userspace This is because this is 'soft' mount. For hard mounts, the system would attempt to mount even when the share is not presented on the destination IP.

ezaton commented 4 years ago

Which, by the way, is the "normal" way HA clusters function - first IP address, then disk, and then exports.

nackstein commented 4 years ago

I made some more test and tried to better understood the soft mount. I badly interpreted the meaning of timeo option. in soft mount the maximum timeout before a "minor timeout" trigger is 600 seconds and this value is not configurable. timeo is the time between retry but once 600 seconds passed the "minor timeout" let know the userspace of the error and write a kernel message. once all the configured retry has failed the "major timeout" let pass the error to userspace and give up retrying. I was able to recover the VM after 8 minutes but at 10 minutes no matter what value of timeo and retrans the error is propagated to userspace.

the algorithm should aproximately be like this:

!/bin/bash

TIMEO=100 RETRANS=3

TIMEOUT=600 while true; do echo retrying NFS request and wait $((TIMEO/10))s sleep $((TIMEO/10)) TIMEOUT=$((TIMEOUT-(TIMEO/10))) RETRANS=$((RETRANS-1)) if [ $RETRANS -eq 0 ]; then echo hard timeout: giving up retrying break; fi if [ $TIMEOUT -le 0 ]; then echo soft timeout: server not responding TIMEOUT=600 fi done

what I still don't get is why the fs mounted with errros=continue get remounted in read-only...

ghost commented 4 years ago

Did you actually mount with -o errros=continue? I wonder how xfs, btrfs, zfs and the like behaves.

The 10 minute deadline on soft fail seems quite a severe problem. It should work fine with short network blipps, but not anything longer like a full server reboot (which could take longer if needing some manual fixing).

ezaton commented 4 years ago

So let me get this - you are trying to perform a test where you customise the mount options, you customise the VM settings, and all this to avoid 'hard' in the NFS mount options? Hard mount means that no IO gets lost in the way to the storage. It means that if your VM has "thought" it has written something, it will either get written, or leave your VM diskless (and thus - crush consistent). It means that any IO meant to get to the storage is accounted for, and no loss IO operations could corrupt, say, your database residing on virtual machine, because the DB author did not plan for IO errors getting to the application layer. And this is especially true when you're dealing with HA solutions, where hard is the recommended method of mounting. More of that - Oracle DB supports NFS mounts, except that they have to be defined 'hard' and 'nointr' (because you do not want to lose IO operations to the disk and "think" they might have been there), VMware use 'hard' mounts for NFS, and even XCP-ng 8.0 (I have just checked, but I bet other versions as well) makes use of 'queue_if_no_path' for multipathing, which means you freeze the IO if all paths to the LUN are lost - for LVM-based SR. All this, and you are still looking for ways to twist everything else - soft mount options, timeouts, how clusters perform failover (bring the IP after the NFS share, for example), just to avoid the 'hard' mount option, which is the only(!) consistent choice here. I support the idea of allowing selection in the GUI, where 'hard' is the default (for VM SR only. I support 'soft' NFS mounts for ISO SR), but it does not solve the problem for people who upgraded from previous versions of XCP-ng or XS. They remain with 'soft' mount options. I am a great supporter of XS (and XCP-ng) in production environments, and I support many such environments - some of them extremely large, and for those using NFS SR, it has been more than once that I was required to reboot production Linux machines (RHEL5/6/7, Oracle Linux 6/7 mainly) because their OS switched to RO mode due to IO errors the VMs should not have even seen. It is bad practice to set errors=continue for Linux filesystem(s) - it might lead you to data corruption - so your point would be proven - you will have a soft mount, but every time your NFS cluster failsover, you have a certain substantial chance of data corruption on some of your Linux VMs. Think about it. Is it worth it?

nackstein commented 4 years ago

Did you actually mount with -o errros=continue? I wonder how xfs, btrfs, zfs and the like behaves.

The 10 minute deadline on soft fail seems quite a severe problem. It should work fine with short network blipps, but not anything longer like a full server reboot (which could take longer if needing some manual fixing).

yes I have error=continue root@ubuntu01:~# tune2fs -l /dev/xvda2 | grep -i beha Errors behavior: Continue

I tried to verify if the problem of read-only was not on fs but on underlaying block device but this is the output of blockdev once the fs in read-only: root@ubuntu01:~# blockdev --report RO RA SSZ BSZ StartSec Size Device ro 256 512 1024 0 93454336 /dev/loop0 ro 256 512 1024 0 57352192 /dev/loop1 rw 256 512 4096 0 10737418240 /dev/xvda rw 256 512 4096 2048 1048576 /dev/xvda1 rw 256 512 4096 4096 10734272512 /dev/xvda2 rw 256 2048 2048 0 7751680 /dev/sr0

ghost commented 4 years ago

IMHO this is an important question. Xenserver default is soft, and to make an informed decision to change it we need to document and understand the differences and consequences of each option for different scenarios. Ideally document best practices on the wiki.

ezaton commented 4 years ago

This is exactly the point. There were reasons years ago for the 'soft' directive, and they were probably had to do with the inability to handle stale NFS mount, when the server is gone. This is a normal (and expected, and even, if I may say so - desirable) behaviour, because it keeps the data integrity guaranteed. However - back then, the entire SM mechanism would have probably halted. Probably had to do with the command 'df' or one of its family members, which tend to hang when stale NFS mount exists. Being unable to handle commands timeout and logically understanding that there is a problem with a specific SR - it is a problem. I think XS (and as a by-product XCP-ng) has passed a long way since. VM disk integrity is vital, short IO errors should not reach the VM - either it's working, or it's dead. There should never, in my opinion, be a middle ground where the VM gets IO error because the host had anything to do with whatever (be it a network glitch, an NFS cluster failover, an NFS server reboot - whatever). VMs should be entirely agnostic of the under-layer. VMs are affected by hardware performance (storage, network, CPU and even memory limitations), however, they should not experience IO errors (much like a VM would not experience a network cable disconnection. It will "feel" network packets get lost, but not an unplugging event). As I have mentioned before - for my systems, if using NFS, I used to define 'hard' in nfs.py, because otherwise, in any network or NFS event, I would have had to reboot half my Linux machines (those who got IO error, and switched their OS mount to read-only). While this can be averted today by using the 'hard' directive in the custom mount options (I have tested it in the past, and it didn't work, so I am amazed it does now, but I can't complain), this is not the default, and it could reduce the levels of trust with XS/XCP-ng because of disk reliability and behaviour during NFS failover. The uninformed (those who look for a virtualization solution, but are not proficient with Linux, NFS behaviour and the likes) would see this (IO errors for VMs, disks turning read-only, and a possible data corruption) as a problem, and blame it on the virtualization platform. I can't really blame them for that.

nackstein commented 4 years ago

So let me get this - you are trying to perform a test where you customise the mount options, you customise the VM settings, and all this to avoid 'hard' in the NFS mount options?

I had a lot of trouble in the past with the hard mount options and while mostly due to buggy software that maybe are just a thing of the past I want to consider that citrix chose for the soft option and I suppose for a reason. Even with all the trouble I always choose hard because I didn't know how the software stack would react to a fs error. In this case I wanted to make some test to understand if it's possible to have a robust solution with the soft option, I didn't point out what it the correct solution, I just pointed out of possible problem to consider and add to the equation. As you my top priority is data integrity but I won't forget about manageability. So I want to verify if it's possible to have both. Soft option is wrong? fine, I accept it. Is testing worth my time? absolutely.

back to the interesting things: I tried hard mount option (all other defaults, just added "hard" in the user mount option) and after 3 minutes of the ubuntu server 19.10 I get the read-only fs problem if I test with unexport/export

same configuration but testing with changing routes does not have problem and the VM continue working once the NFS share is reachable

I think the timeout for the missing share is explained here in the man page at option retry=n: Note that this only affects how many retries are made and doesn't affect the delay caused by each retry. For UDP each retry takes the time determined by the timeo and retrans options, which by default will be about 7 seconds. For TCP the default is 3 minutes, but system TCP connection timeouts will sometimes limit the timeout of each retransmission to around 2 minutes.

this means that there is a corner case that even hard mount option can give an I/O error to the userspace and in our case the VM fs goes in read-only. I didn't find any way to change this 3 minutes defaults, so even with hard mount option in case of a NFS cluster switch it's advisable to export the share before assigning the VIP. I don't know the behavior of appliances like netapp, I will check what serviceguard+nfs integration does, it's the only enterprise software I can check. On linux cluster there are plenty of option and configuration possibilities.

StreborStrebor commented 4 years ago

I really hope you manage to fix this, also for current soft mounted NFS SRs.

For me, Debian VMs going RO (Debian by default uses Continue for Errors behaviour) and requiring manual fsck after outage on the NFS server or network, is a very vulnerable point on XCP-ng (as it was before on Xenserver, I never got round to figuring out if there might be a solution to this behaviour). These outages hardly ever occur, but when they do - like they did last week for me - they cause havoc.

I would much rather have my running VMs frozen with data in tact during an outage, than crashed VMs after an outage, requiring manual fsck and hoping I don't need to restore backups.

So, if hard mounted NFS SRs (and soft mounted ISO SRs) are the best option (now, maybe not in the past) then please give users the option!

olivierlambert commented 4 years ago

@StreborStrebor feel free to use hard option directly in XO at SR creation. We'll probably add a selector to assist user on the choice. Then, please report if it actually solves your issue :)

StreborStrebor commented 4 years ago

OK, I'll do the following:

  1. Add a new SR on a spare NAS here today, using the NFS hard mount option
  2. Move a test VM with (simulated) IO to the hard mounted SR
  3. Mess around with the NAS (reboot it, disconnect network) and see if the Debian VM services
  4. Send you the results of this test here
olivierlambert commented 4 years ago

Perfect! This feedback will be helpful for everyone :+1:

StreborStrebor commented 4 years ago

Ok guys,

Problem: When NFS storage is unavailable (server or network interruptions) then VM filesystems go RO and that crashes stuff, resulting in manual fsck after reboot on each VM.

Possible fix: Based on this discussion, there is a possibility that changing the NFS from soft mount to hard mount, will solve the behaviour, and freeze a VM till the filesystem is available again.

Test: Disrupt NFS storage on a running Debian VM, see if it survives an outage of an hour.

TL;DR: Hard mounted NFS (4.1, I did not test other versions, yet) Debian VMs on XCP-ng 8.0 hosts don’t crash during long NFS storage outage and the filesystem stays fine (no fsck required)

The full story:

I setup a NFS 4.1 share on a modern Synology Nas, latest Syanology OS. I mounted the storage using NFS 4.1 to my Xenserver 8.0 pool (with all yum updates) using XenCenter (Windows software) and in the options field I entered: hard. All other values are default XCP-ng values. I checked/confirmed the NFS storage was mounted hard using the xe sr-list command: name-label ( RW): NFS VM storage on NAS02 (nfs4.1, hard)

I installed a new Debian 10 VM on the hard mounted storage I made sure with apt-update it was up to date and had the xentools installed. It by default had error options: Continue (checked with tune2fs -l /dev/xvda2)

I wrote a simple bash script that every 5 seconds (12 times) echo’s the current date, uptime and load average to the top of a simple html file on the filesystem. In cron, every minute I call the bash script on te server. So every 5 seconds the date, uptime and load average are written to this html file on the filesystem of the test VM. I installed apache on the VM, which serves the html file with all the timestamps. On another local Debian VM (on a different storage, but in the same XCP pool) I configured a cronjob that every minute does a wget and requests the html page from the test VM and saves the page to a temp directory. I tested the above all works.

The test: I opened a SSH shell connection to the Debian testserver At 12:37:42 I disabled NFS on the NAS (emulating unavailable NFS server / network outage).

Tests during NFS outage on VM During the NFS downtime I in the shell saw that the uptime load levels were high and kept climbing. Conclusion: The VM does not freeze, when the NFS storage is unavailable. Only it’s filesystem freezes. Processes and memory are still alive.

Tests during NFS outage on the XCP Poolmaster host server During the NFS downtime I every now and then ran the shell command xe sr-list on the Poolmaster XCP server (this did not give any problems and output was quick) During the NFS downtime I ran the shell command df -h on the Poolmaster XCP server (this command stalled, but could easily be killed with CTRL+C) During the NFS donwtime, other VMs running on another storage ran without any problems or hickups. Conclusion: XE storage commands don’t seem to freeze. DF commands do stall, but can be killed without a problem. Other VMs on other storage but in the same pool, run fine during the NFS outage.

After 60 minutes I enabled NFS on the NAS again. At that time the load average on the VM was: 13:38:27 up 14:40, 1 user, load average: 69.04, 64.98, 54.21 Within 2 minutes the VM was alive again and the disk was not mounted RO. The filesystem seemed to be fine. Load average after 2 minutes: 13:40:27 up 14:42, 1 user, load average: 30.67, 56.06, 52.54 Load average after 3 minutes: 13:41:26 up 14:43, 1 user, load average: 12.24, 46.63, 49.51 Load average after 4 minutes: 13:42:26 up 14:44, 1 user, load average: 4.50, 38.14, 46.41 Load average after 5 minutes: 13:43:28 up 14:45, 1 user, load average: 1.52, 30.68, 43.27 Load average after 6 minutes: 13:44:26 up 14:46, 1 user, load average: 0.60, 25.52, 40.78 Also running tune2fs -l /dev/xvda1 on the test VM, shows that the filesystem is still clean after this 1 hour+ NFS outage

Yesterday I did a similar test, but then for 5 hours on the same VM, and it survived that outage too.

See the output of the html file attached to this post (some excerpts below) The VM starts writing timestamps and load averages to the html file. But, It does look like the VM is trying to catch up on all the stuck/blocked/frozen writes to the html file. This is visible by the multiple non unique timestamps (when running normally, timestamps should be unique and have 5s gaps). The timestamps use the bash $(date) function, and it looks like this date value is being rendered at the moment the file is actually written.

So the big question is: I wonder if MySQL and other software has a filesystem timeout built in, to protect against weird side effects, of delayed/froen read/writes. This needs more investigation.

Final test is a reboot of the test VM (reboot via Xencenter) This goes fine, no disk/file problems at all.

So my main conclusion is: Hard mounted NFS (4.1, I did not test other versions, yet) Debian VMs on XCP-ng 8.0 hosts don’t crash during long NFS storage outage and the filesystem stays fine (no fsck required)

Excerpts from the log file:

Cut here, Running fine again, low load: Thu Mar 12 13:47:11 CET 2020 - 13:47:11 up 14:49, 1 user, load average: 0.04, 14.68, 34.13

Cut here, Running fine again, high load: Thu Mar 12 13:40:46 CET 2020 - 13:40:46 up 14:42, 1 user, load average: 23.88, 53.32, 51.69 Thu Mar 12 13:40:41 CET 2020 - 13:40:41 up 14:42, 1 user, load average: 25.96, 54.22, 51.97 Thu Mar 12 13:40:36 CET 2020 - 13:40:36 up 14:42, 1 user, load average: 28.22, 55.13, 52.25 Thu Mar 12 13:40:35 CET 2020 - 13:40:35 up 14:42, 1 user, load average: 28.22, 55.13, 52.25

Cut here, When NFS came back up and the VM starting writing again, high load and duplicate timestamps: Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:49 CET 2020 - 13:39:49 up 14:41, 1 user, load average: 59.81, 64.10, 54.85 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:44 CET 2020 - 13:39:44 up 14:41, 1 user, load average: 65.02, 65.18, 55.15 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45 Thu Mar 12 13:39:39 CET 2020 - 13:39:39 up 14:41, 1 user, load average: 70.68, 66.28, 55.45

Cut here, Just before I disrupted NFS connection: Thu Mar 12 12:37:46 CET 2020 - 12:37:46 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:41 CET 2020 - 12:37:41 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:36 CET 2020 - 12:37:36 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:31 CET 2020 - 12:37:31 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:26 CET 2020 - 12:37:26 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:21 CET 2020 - 12:37:21 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:16 CET 2020 - 12:37:16 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:11 CET 2020 - 12:37:11 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:06 CET 2020 - 12:37:06 up 13:39, 1 user, load average: 0.00, 0.00, 0.00 Thu Mar 12 12:37:01 CET 2020 - 12:37:01 up 13:38, 1 user, load average: 0.00, 0.00, 0.00

vm-test-date-load-file.txt

ezaton commented 4 years ago

This is a very thorough test, and it's great. In order to understand the "edge" cases, another minor actions might be in order:

  1. Snapshot/Clone/Migrate VMs residing on other SRs. This will ensure that every SR handler stands alone, and have no dependencies on other SRs.
  2. Attempt the same with the "faulty" SR. I believe that Xapi will have stale tasks.
  3. 'df -h' command on stale NFS should probably hang, unless the NFS mount has either 'intr' directive, or you did not wait enough for the NFS to hang.
  4. I would have attempted action #1 both before and after action #2. This tests if the SM process gets hanged by the NFS commands and cannot spawn additional commands (for other SRs), or not. This is an important test, and it should reflect on all of XCP SM mechanism, if it fails there. This can be a huge one - because SM mechanism should(!) handle each SR independently. If it cannot do so - this is a major flaw.
ezaton commented 4 years ago

Oh, and about applications getting the disk timeouts - no. They hardly ever do that. Most application will either hold an operation timeout (task X should complete within a defined time frame) or respond to IO errors reported by the underlying filesystem, but for stale disks (say - very slow disks. Very very slow), they just wait. I have had a lot of experience with this and Oracle RDBMS, which supports NFS, and requires the directives hard,nointr and some more (less relevant in our case). The combination of both directives means that for a stale NFS share - you just wait to the end of the world - or until your NFS share becomes available - The application cannot 'break' and the depending processes just hang out there waiting (and maybe afterwards fail, but this is when the disk is available again, and they can crash responsibly).

StreborStrebor commented 4 years ago

@ezaton

Good idea's!

I'll run these suggested tests as soon as I find a moment.

Edit: Been a very busy day here, and not got round to it yet. But I hope to find some time this weekend or Monday. I have worked out all the tests that need to be done.

StreborStrebor commented 4 years ago

Hi guys,

I have good news! I ran the tests (see the test and results below) and all went well.

Conclusion: In a pool with multiple SRs, hard mounted on multiple NFS storage devices, when one of the hard mounted SRs goes down or becomes unavailable, there is no damage or downtime to running VMs and VM disks, that are on another SR that is still available to the pool. Also not, when commands like df, pvs and lvs are run on any of the pool’s host servers. It seems those storage related commands do hang on the console, but do not freeze the Pool, Hosts or VMs.

The test environment:

Pool with two XCP 8.0 servers (Xen30 Poolmaster and Xen31 Poolmember) All start/stop/copy/move/migrate/clone/snapshot operations done from XCP-ng Center 20.03.00

Main VM storage: Nas03 - Synology nas, with NFS 3 soft mount Test VM storage: Nas02 - Synology nas, with NFS 4.1 hard mount (The NFS storage we’re going to disrupt again) Extra VM storage: Nas01 - Synology nas, with NFS 3 hard mount

TestVM1: on Nas02: Debian 10 VM (this VM will loose it’s storage) TestVM2: on Nas03: Debian 9 VM (Fun0001) TestVM3: on NAS03: Windows 10 VM (Lukes VM)

The tests:

1: Disable SR (Nas02, hard NFS4.1) for Debian TestVM1 again

2A: Snapshot running VMs (2 and 3) (in the same pool on the same host: Xen30) with storage on another SR (Nas03) 2B: Power off, and Clone (Copy) VMs (2 and 3) (in the same pool on the same host: Xen30) with storage on another SR (Nas03 to Nas01) 2C: Delete VM (2 and 3) Clones from step 2B 2D: Power on original VMs (that were cloned) 2E: Migrate (memory/processes migration) running VMs (2 and 3) (in the same pool on the same host) with storage on another SR (Nas03) to another host in the pool (and migrate back again) 2F: Migrate (storage migration) running VMs (2 and 3) (in the same pool on the same host) with storage on another SR (Nas03) to another SR (Nas 01) in the pool (and migrate back again) 2G: Delete VM (2 and 3) Snapshots from step 2B

3A: On the Xen30 Poolmaster run command: xe sr-list 3B: On the Xen30 Poolmaster run command: xe cd-list 3C: On the Xen30 Poolmaster run command: xe network-list 3D: On the Xen30 Poolmaster run command: xe vm-list 3E: On the Xen30 Poolmaster run command: df -h 3F: On the Xen30 Poolmaster run command: pvs 3G: On the Xen30 Poolmaster run command: lvs 3H: On the Xen30 Poolmaster run command: list_domains 3I: On the Xen30 Poolmaster run command: iostat -d 2 6 3J: On the Xen30 Poolmaster run command: more /etc/mtab

4A: Snapshot running VMs (2 and 3) (in the same pool on the same host: Xen30) with storage on another SR (Nas03) 4B: Power off, and Clone (Copy) VMs (2 and 3) (in the same pool on the same host: Xen30) with storage on another SR (Nas03 to Nas01) 4C: Delete VM (2 and 3) Clones from step 2B 4D: Power on original VMs (that were cloned) 4E: Migrate (memory/processes migration) running VMs (2 and 3) (in the same pool on the same host) with storage on another SR (Nas03) to another host in the pool (and migrate back again) 4F: Migrate (storage migration) running VMs (2 and 3) (in the same pool on the same host) with storage on another SR (Nas03) to another SR (Nas 01) in the pool (and migrate back again) 4G: Delete VM (2 and 3) Snapshots from step 2B

5: Enable SR (Nas02, hard NFS4.1) for Debian TestVM1 again

The test results: (test number, time completed, remarks)

@1 17:01 NAS2 disable OK

@2A 17:02 OK - Snapshot OK @2B 17:15 OK - Power off, and Clone OK @2C 17:16 OK - Delete clones OK @2D 17:17 OK - Power on originals again OK @2E 17:20 OK - Migrate (memory/processes migration) running VMs OK @2F 17:45 OK - Migrate (storage migration) running VMs OK @2G 17:47 OK - Delete VM (2 and 3) Snapshots OK

@3A 18:09 OK xe sr-list @3B 18:09 OK xe cd-list @3C 18:09 OK xe network-list @3D 18:10 OK xe vm-list @3E 18:13 df -h NOT OK: df -h just sat there waiting. CTRL+C (after 120 seconds) stopped it without any problem and immediately returned the prompt. Read and write operations to another VM in the pool on NAS03 went fine @3F 18:15 pvs NOT OK: pvs just sat there waiting. CTRL+C (after 120 seconds) stopped it without any problem and immediately returned the prompt. Read and write operations to another VM in the pool on NAS03 went fine @3G 18:16 lvs NOT OK: pvs just sat there waiting. CTRL+C (after 120 seconds) stopped it without any problem and immediately returned the prompt. Read and write operations to another VM in the pool on NAS03 went fine @3H 18:17 OK list_domains @3I 1817: OK iostat -d 2 6 (iostat does need CTRL+C to stop, but no further problem) @3J 18:18 OK more /etc/mtab (“more /etc/mtab |grep nas02” gave an empty result)

@4A 18:21 OK - - Snapshot OK @4B 18:29 Power off and clone VM 2 OK, VM3 NOT OK, but that is because by accident I tried to copy it to NAS02 which is down. Cancelling the copy action is not having any effect, So I will wait till NAS02 is back online @4C 18:29 - Delete clone OK @4D 18:30 - Power on originals again : VM2 OK, VM3 not OK, no options at all. I’ll wait a little longer… @4E 18:32 Migrate (memory/processes migration) running VMs: - VM2 OK @4F 19:14 Migrate (storage migration) running VMs: VM2 OK

I took a break and continued at 21:35. I saw that the disk copying of VM3 to NAS02 had now failed in the XenCenter and the VM was available to restart again:

@4D 21:35 OK Powered on VM3 as well @4F 21:43 OK Migrated VM3 as well @4G 21:44 Delete VM (2 and 3) Snapshots OK

@5 After enabling the SR on NAS02 again, the VM1 came back to life: (also commands like df, povs and lvs no longer freeze on the pool's host servers)

Mon Mar 16 21:59:36 CET 2020 - 21:59:36 up 4 days, 23:01, 0 users, load average: 0.07, 44.75, 211.54 ..cut.. Mon Mar 16 21:48:02 CET 2020 - 21:48:02 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 ..cut.. Mon Mar 16 21:48:01 CET 2020 - 21:48:01 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:01 CET 2020 - 21:48:01 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:01 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:01 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:01 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 Mon Mar 16 21:48:00 CET 2020 - 21:48:00 up 4 days, 22:49, 0 users, load average: 462.42, 458.35, 447.22 ...nas down between these timestamps... Mon Mar 16 17:01:26 CET 2020 - 17:01:26 up 4 days, 18:03, 1 user, load average: 0.03, 0.02, 0.00 Mon Mar 16 17:01:21 CET 2020 - 17:01:21 up 4 days, 18:03, 1 user, load average: 0.03, 0.02, 0.00 Mon Mar 16 17:01:16 CET 2020 - 17:01:16 up 4 days, 18:03, 1 user, load average: 0.04, 0.02, 0.00 Mon Mar 16 17:01:11 CET 2020 - 17:01:11 up 4 days, 18:03, 1 user, load average: 0.04, 0.02, 0.00 Mon Mar 16 17:01:06 CET 2020 - 17:01:06 up 4 days, 18:03, 1 user, load average: 0.04, 0.02, 0.00

StreborStrebor commented 4 years ago

So...

It would be great if it became possible to remount existing soft mounted NFS SRs with the option hard, keeping all VM's nicely attached to their VM disks (but now on hard mounted NFS SR).

Now the only option - with no downtime at all - seems to be to:

  1. Move all VM disks to another SR (SR2)
  2. Forget the original SR1
  3. Reattach the original SR1 with option hard
  4. Move the disks back to SR1 (now hard mounted)

The problem here is, that disk migration has never been the fastest of operations. It takes many hours to migrate 1TB worth of dIsks (and many hours to migrate them back again).

I can't really think of a simple solution to:

  1. Stop all VMs
  2. Unmount the softmounted NFS SR
  3. Remount the NFS SR hardmounted
  4. Keeping all VM's attached to their disks (but maybe, if reattaching to existing SR UUID at step 3, this would actually work)
StreborStrebor commented 4 years ago

One test still comes to mind though:

(Rotating pool) host upgrade to new minor XCP-ng version (say 8.0 to 8.1). Will a hard mounted NFS SR stay hard mounted after the upgrade?

That's one test I can't afford to do at the moment..

ezaton commented 4 years ago

I would add a single test to the flow suggested by @StreborStrebor - perform large deletions of snapshots. Join the slow coalesce process into the party. As I have mentioned before - I have been using NFS for years now. Really - since around XS version 5.x or so. The soft mount has always caused problems, and starting at a certain point in time, I have been manually modifying the nfs.py code to hard mount. This way - even previously soft mounted shared get remounted 'hard' when the node reboots. This way no data migration is needed (the current cluster I have has about 20TB of VDIs. No way in hell I can move them around easily). The suggested change to the code which enforces hard mount by default to all VM SRs should do just this. Any VM SR will be mounted 'hard'. I firmly believe that 'hard' should be the default and the only option to NFS VM SRs. Otherwise the user is in danger of data corruption, which is the worst possible case for a virtualization platform user.