Live migrating HVM linux (and others?) with storage migration from any older release to 7.6 works but vm dead

oallart commented 5 years ago

Situation:

Any OS (tested with multiple, centos 7.6 for reference) running in a VM on a pre 7.6 xcpng server using local storage, works fine. When live migrating the running VM to a newer xcp-ng 7.6, everything seems to work as usual but the VM is dead on arrival. No console (white screen) but VM is marked as running. Migration done in xcp-ng center.

More detail:

VM is unresponsive, no console, no network
Some message on 'xen tools out of date' shows up in xencenter
local storage only, no shared storage, no pool or master
live migrating from any version (7.1/2/3/4/5) to 7.6 has the same issue
live migrating from 7.6 to 7.6 works fine
live migrating from any version (7.1/2/3/4/5) to 7.5 works fine
rebooting the VM fixes the issue (defeats the live migration concept)
no error in console logs
updating XS tools does not fix anything (updated, restarted, migrated, same issue)
can not try on xenserver 7.6 since they restricted live migration in 7.3 on the lowest unpaid tier version
source and target hardware is identical (brand new supermicro X11)

It looks like something has changed in 7.6. We haven't found any literature on the issue so far, or any similar issue. For the time being we have been forced to revert to 7.5.0-2. Happy to provide more detail on request.

nicodemus commented 5 years ago

I had the same issue going from XenServer 6.5 to XCP-NG 7.6. All of the PV guests migrated fine, but every single HVM VM was dead after migration. Some had 6.5 tools installed, some had 7.5, some had none at all. All VMs were Linux, various distros and various versions. VM showed as running, but console was non-responsive, and the VMs were softlocked. Restarting the VM was the only way to get them back.

stormi commented 5 years ago

This is interesting. I had the same issue yesterday while doing some tests related to #90 (I built a version of XAPI that would allow storage motion during rolling pool upgrade) and was wondering if that was because of my nested virtulization setup. Turns out it wasn't.

I'll try to reproduce in XenServer.

stormi commented 5 years ago

I have reproduced in a nested XenServer VM. Unless I failed my test, this proves that the same issue exists in XenServer. The migration was done in Xen Orchestra, so that's not specific to XCP-ng Center or XenCenter.

stormi commented 5 years ago

I have reported the issue to XenServer's team: https://bugs.xenserver.org/browse/XSO-924

nicodemus commented 5 years ago

Forgot to add, my setup uses shared NFS storage vs the submitter's local storage.

oallart commented 5 years ago

Thanks @stormi

stormi commented 5 years ago

@nicodemus was the migration within a pool being upgraded, or did you create a separate pool for XCP-ng and migrate the VMs from the XS pool to the XCP-ng pool?

nicodemus commented 5 years ago

@nicodemus was the migration within a pool being upgraded, or did you create a separate pool for XCP-ng and migrate the VMs from the XS pool to the XCP-ng pool?

It was a pool of two servers being upgraded. I evacuated one node and upgraded from XenServer 6.5 to XCP-NG 7.6. When trying to migrate the VMs off the remaining XS 6.5 box to the XCP7.6 box is when the issue was experienced. Every HVM guest migrated 'successfully', but was dead and had to be hard reset. All PVM guests migrated without a hitch.

stormi commented 5 years ago

I've made more tests, here are the results:

CentOS 7 64 bits, HVM: bug hit, it is migrated but fails to resume and has to be restarted
CentOS 6.6 64 bits, PV: migrates smoothly
Windows 7 32 bits, HVM: migrates smoothly

For the CentOS 7 VM, migrations made from XS 6.5, XCP-ng 7.4, XCP-ng 7.5 towards either XS 7.6 or XCP-ng 7.6. Same results in every case. XS 7.6 to XCP-ng 7.6 works fine.

For CentOS 6.6 PV and Windows 7, I only tested migration from XS 6.5 to XS 7.5 and then from XS 7.5 to XS 7.6.

It does not matter whether the VM is migrated from or to local storage or shared storage. What matters is that the VDI has to be migrated, which was the case in all my tests since those were cross-pool migrations.

olivierlambert commented 5 years ago

So in short, the bug is triggered with Xen Storage Motion while moving to a XCP-ng (or XS) 7.6 host, for some HVM guest, correct? (regardless cross or intra pool?)

stormi commented 5 years ago

Yes, though intra pool will require a modified XAPI because otherwise it won't allow you to migrate with Xen Storage Motion during a pool upgrade (#90).

And "Some HVM guests" seem to be all linux HVM guests for now, and possibilty some others.

oallart commented 5 years ago

XS has reported the reason for the issue and a workaround;

Until this is fixed, to work around this either install the VM from one of the other HVM Linux templates (e.g. CentOS 7) or if the VM already exists, set the device id (xe vm-param-set uuid=... platform:device_id=0001) and reboot the VM before migrating it to XS 7.6.

Am not happy with that part:

reboot the VM before migrating

Some people have also said that the issue affects them even with platform:device_id=0001 set.

Indeed, most of our systems are created with "other media" and platform device ID is not set. Digging that platform device ID yields https://xenbits.xen.org/docs/4.6-testing/misc/pci-device-reservations.txt which indicates it is a PCI mechanism that has been around for a while. My question is, why is this affecting us now?

stormi commented 5 years ago

@oallart were your own VMs installed with the "Other install media" template?

oallart commented 5 years ago

@stormi absolutely, and all centos 7.x in my tests. We typically pxe boot our VMs to start with.

stormi commented 5 years ago

It will be fixed in the next releases of XenServer and XCP-ng.

The patch seems to be https://github.com/xapi-project/xenopsd/commit/67e12a1d0d8f141285dfe208fc5d5ca67b6ce6fa

stormi commented 5 years ago

I have built a a version with a backport of the patch that should fix this issue for XCP-ng 7.6. I'm not sure I will push it to everyone after the tests, but at least it's available to anyone who finds this bug report, and I'm nevertheless highly interested in testing results from anyone who would still have a setup allowing to test it.

Testing an update candidate means basically:

Install it on the relevant host(s)
Make sure it's is used. A reboot is the safest option but not always available. For this specific update, xe-toolstack-restart should be enough.
Ensure you see no regression due to the update.
If possible, test that it fixes what it's meant to fix.

The patch fixes the live migration from older releases of XS or XCP-ng towards XCP-ng 7.6, for VMs that don't have platform:device_id set, that is mostly VMs created with the "other installation media" template.

To install it:

yum update xenopsd xenopsd-xc xenopsd-xenlight --enablerepo='xcp-ng-updates_testing'

To reinstall the previous version:

yum downgrade xenopsd xenopsd-xc xenopsd-xenlight

Ultra2D commented 5 years ago

Tested this using a VM based on template "Debian Wheezy 7.0 (64-bit)" without device_id set.

A clone of that VM could be migrated to XCP-ng 7.6 with the patch installed on the pool master. Migrating another clone to XCP-ng without the patch on the pool master results in a stuck VM, so it works!

stormi commented 5 years ago

Thanks!

olivierlambert commented 5 years ago

Yay!!

stormi commented 5 years ago

A point of vigilance regarding this update (and the upcoming XCP-ng 8.0 that includes a similar fix by default): might make the first in-pool (homogeneous pool) migration of a VM without device_id set fail?

(see https://xcp-ng.org/forum/post/9742)

stormi commented 5 years ago

Note: still interested in feedback from the community (having migration issues is not a requirement for testing the update. Installing it and continuing with normal operations is also a way to test that there is no regression). I'll not consider pushing it to everyone unless there is enough testing that I can rely on.

stormi commented 5 years ago

After finding the time to test it myself, I finally pushed the update. We could have been released months sooner if at least one more person from the community would have found the time to test, but I guess we're all busy people :)

stormi commented 5 years ago

I have withdrawn the update. It does work and fix the issue, but it causes a bug during the pool upgrade (unless one updates all hosts and restarts the toolstack before migrating the VMs prior to any reboot)

xcp-ng / xcp

Live migrating HVM linux (and others?) with storage migration from any older release to 7.6 works but vm dead #111