Closed oallart closed 5 years ago
I had the same issue going from XenServer 6.5 to XCP-NG 7.6. All of the PV guests migrated fine, but every single HVM VM was dead after migration. Some had 6.5 tools installed, some had 7.5, some had none at all. All VMs were Linux, various distros and various versions. VM showed as running, but console was non-responsive, and the VMs were softlocked. Restarting the VM was the only way to get them back.
This is interesting. I had the same issue yesterday while doing some tests related to #90 (I built a version of XAPI that would allow storage motion during rolling pool upgrade) and was wondering if that was because of my nested virtulization setup. Turns out it wasn't.
I'll try to reproduce in XenServer.
I have reproduced in a nested XenServer VM. Unless I failed my test, this proves that the same issue exists in XenServer. The migration was done in Xen Orchestra, so that's not specific to XCP-ng Center or XenCenter.
I have reported the issue to XenServer's team: https://bugs.xenserver.org/browse/XSO-924
Forgot to add, my setup uses shared NFS storage vs the submitter's local storage.
Thanks @stormi
@nicodemus was the migration within a pool being upgraded, or did you create a separate pool for XCP-ng and migrate the VMs from the XS pool to the XCP-ng pool?
@nicodemus was the migration within a pool being upgraded, or did you create a separate pool for XCP-ng and migrate the VMs from the XS pool to the XCP-ng pool?
It was a pool of two servers being upgraded. I evacuated one node and upgraded from XenServer 6.5 to XCP-NG 7.6. When trying to migrate the VMs off the remaining XS 6.5 box to the XCP7.6 box is when the issue was experienced. Every HVM guest migrated 'successfully', but was dead and had to be hard reset. All PVM guests migrated without a hitch.
I've made more tests, here are the results:
For the CentOS 7 VM, migrations made from XS 6.5, XCP-ng 7.4, XCP-ng 7.5 towards either XS 7.6 or XCP-ng 7.6. Same results in every case. XS 7.6 to XCP-ng 7.6 works fine.
For CentOS 6.6 PV and Windows 7, I only tested migration from XS 6.5 to XS 7.5 and then from XS 7.5 to XS 7.6.
It does not matter whether the VM is migrated from or to local storage or shared storage. What matters is that the VDI has to be migrated, which was the case in all my tests since those were cross-pool migrations.
So in short, the bug is triggered with Xen Storage Motion while moving to a XCP-ng (or XS) 7.6 host, for some HVM guest, correct? (regardless cross or intra pool?)
Yes, though intra pool will require a modified XAPI because otherwise it won't allow you to migrate with Xen Storage Motion during a pool upgrade (#90).
And "Some HVM guests" seem to be all linux HVM guests for now, and possibilty some others.
XS has reported the reason for the issue and a workaround;
Until this is fixed, to work around this either install the VM from one of the other HVM Linux templates (e.g. CentOS 7) or if the VM already exists, set the device id (xe vm-param-set uuid=... platform:device_id=0001) and reboot the VM before migrating it to XS 7.6.
Am not happy with that part:
reboot the VM before migrating
Some people have also said that the issue affects them even with platform:device_id=0001
set.
Indeed, most of our systems are created with "other media" and platform device ID is not set.
Digging that platform device ID
yields https://xenbits.xen.org/docs/4.6-testing/misc/pci-device-reservations.txt which indicates it is a PCI mechanism that has been around for a while. My question is, why is this affecting us now?
@oallart were your own VMs installed with the "Other install media" template?
@stormi absolutely, and all centos 7.x in my tests. We typically pxe boot our VMs to start with.
It will be fixed in the next releases of XenServer and XCP-ng.
The patch seems to be https://github.com/xapi-project/xenopsd/commit/67e12a1d0d8f141285dfe208fc5d5ca67b6ce6fa
I have built a a version with a backport of the patch that should fix this issue for XCP-ng 7.6. I'm not sure I will push it to everyone after the tests, but at least it's available to anyone who finds this bug report, and I'm nevertheless highly interested in testing results from anyone who would still have a setup allowing to test it.
Testing an update candidate means basically:
xe-toolstack-restart
should be enough.The patch fixes the live migration from older releases of XS or XCP-ng towards XCP-ng 7.6, for VMs that don't have platform:device_id
set, that is mostly VMs created with the "other installation media" template.
To install it:
yum update xenopsd xenopsd-xc xenopsd-xenlight --enablerepo='xcp-ng-updates_testing'
To reinstall the previous version:
yum downgrade xenopsd xenopsd-xc xenopsd-xenlight
Tested this using a VM based on template "Debian Wheezy 7.0 (64-bit)" without device_id
set.
A clone of that VM could be migrated to XCP-ng 7.6 with the patch installed on the pool master. Migrating another clone to XCP-ng without the patch on the pool master results in a stuck VM, so it works!
Thanks!
Yay!!
A point of vigilance regarding this update (and the upcoming XCP-ng 8.0 that includes a similar fix by default): might make the first in-pool (homogeneous pool) migration of a VM without device_id set fail?
Note: still interested in feedback from the community (having migration issues is not a requirement for testing the update. Installing it and continuing with normal operations is also a way to test that there is no regression). I'll not consider pushing it to everyone unless there is enough testing that I can rely on.
After finding the time to test it myself, I finally pushed the update. We could have been released months sooner if at least one more person from the community would have found the time to test, but I guess we're all busy people :)
I have withdrawn the update. It does work and fix the issue, but it causes a bug during the pool upgrade (unless one updates all hosts and restarts the toolstack before migrating the VMs prior to any reboot)
Situation:
Any OS (tested with multiple, centos 7.6 for reference) running in a VM on a pre 7.6 xcpng server using local storage, works fine. When live migrating the running VM to a newer xcp-ng 7.6, everything seems to work as usual but the VM is dead on arrival. No console (white screen) but VM is marked as running. Migration done in xcp-ng center.
More detail:
It looks like something has changed in 7.6. We haven't found any literature on the issue so far, or any similar issue. For the time being we have been forced to revert to 7.5.0-2. Happy to provide more detail on request.