Closed etomm closed 4 years ago
I would recommend to start here: https://github.com/xcp-ng/xcp/wiki/Troubleshooting dmesg
can be helpful to see if it's kernel/driver related (or something else)
When the issue is pretty vague like this, the forum might be a better place to start investigation :+1: (more people there, this place is more to write identified issues)
I am happy to see that I am not the only one, I also had issue with the version 7.4 - 7.6 and now 8.0. On my side the issue started when a user generated 10 Vm out of a windows 10 template I created (he used the self service), then the first node becomes unavailable, I can't ping it, I can't do ssh in XOA I got this error : connect EHOSTUNREACH 10.16.2.100:443
The two nodes are two Poweredge R630 with 40 CPU, 192 Go of ram each and two ssd 1.6 in raid 0 (this is a lab, it is ok if thing breaks), all the firmware and bios are up to date. When I go on the server console it is lagging, when rebooting it it stay stuck at the splash screen.
I don't understand what is going on...
Only dmesg
or other logs will be useful, otherwise IDK how we could help…
Hi Olivier,
Please find the logs folder (with all the logs) log.zip
I am really scratching my head to find out why this is happening.
This issue was also happening on an Alienware X51 R2 (i7 4790k + 16 go of ram + ssd 512 GB samsung 850 pro)
The only common thing between the Lab servers and the home lab is that I have changed the local sr file system into ext4 using your drivers.
Could this cause that kind of issue?
I was able to replicate it, when I start 10 VM together, then the connection is lost. I have no idea why...
My local storage SR is LVM type. I did not change it to ext4. BUT I have too 7 vm. I switched from OpenVSwitch to Linux Bridge.
This is the log folder, but it is quite big and I shared it from drive https://drive.google.com/file/d/1VT716EIXPT0FXWIfnBOXE8AS4cfXPgFv/view
SO each I have above 8 to 10 vm running, I loose the connection with the host and I have to hard reboot it. any idea what it could be? Regarding the networking I am giving Ip addresses to the vm thought a linux router/firewall appliance (Ipfire).
Can you:
disabling any powersaving feature and the Cstates solved the issue.
I can start all the VM the server works just fine...
Do you know why theses feature are causing any issue with xcp-ng?
Is this a bug or more a requirement to disable them?...
Can you tell me which BIOS feature you disabled exactly and if trying to boot 10 VMs at once is now working?
So in the bios of the first node : Processor settings : X2Apic Mode set to enabled
In System Profile : Set the profile to customer CPU Power Management to performance C1E set to disabled C States set to disabled Uncore frequency set to maximum Energy Efficient Policy set to performance
I can start all the VM at once as a stress test it and it work just fine, before the host would just stop responding.
Funny enough the second node was already configured like this and was not having any issue.
This night I could try disabling the power management. I should say that with 7.6 I was having the problem after one month. With 8.0 updated with latest patches the problem is happening almost in 1 day.
Seems a different problem than the one of @dv-longinus.
I spoke too fast. It was fine for 2 days then it started again. I had to restart the first node this morning. I really need you help to try to figure out what is going on. I can't find anything special in the logs...
I know it sounds a bit weird but have you tried to run a memtest on this machine?
I was running Proxmox on this host with more than 20 running vm per host, if it was a memory issue I would have seen it with proxmox. Do you still want me to do a memtest?
Maybe XCP-ng could access memory a bit differently (ie in other "zones"/range than ProxMox). We don't have any other report of crashes like that. Especially since you are using a very common server (Dell R630 is widespread in XCP-ng world), it seems strange that your problem isn't more common if it was 100% software related.
There's also BIOS/firmwares but you said they are already up to date. I'm pinging @Fohdeesha for feedback on Dell issues.
Also, if you can configure a serial output it might be helpful to catch some messages (@rushikeshjadhav might be able to help on this)
Interesting, I will run a memtest on this node. Yes all the firmware and the bioses are up to date on both nodes. I will report back on the memtest is done.
The only xen related Dell issues I'm aware of were some iommu issues back on the 11th gen (eg r710). They were remedied in the 12th gen and above so I know the r630 is good, I know one Colo provider running xcp-ng on them without issue so I'm not sure what could be going on. I would run the lifecycle controller to ensure all the firmware is actually up to date (people miss a lot of components doing it manually), and run memtest at least overnight
I double checked all the firmware, and updated the NIC firmware, everything else was already up to date.
I am now running a memory test using the memo diagnostic using the lifecycle tool then i will go for memtest
After the lifecycle memtest and two passes of memtest. no memory error is shown.
What could cause the server to be unreachable when I launch to many VM, or when too many Vm are running?
I will try to see if I can reproduce the issue on the second node.
Frankly, I don't know, because I never encountered that issue before. If you can reproduce, I'd like to know from which level of parallel VM boot you start to experience the issue.
SO on the first node when I launch around 10 VM, the node became unreachable. The VM are being created out of sysprep image of differents windows template and they are fast clone of the template.
I was able to reproduce the issue on the second node with around the same a mount of VM running.
Okay somewhat it's a good news. If you manage to reproduce the issue, it means we could find the problem :)
tail -f
your logs, anything specific is displayed before it's unresponsive?Very interesting I tried to create two VM (ubuntu 18.04 template and Windows 10 template), as a full copy.
The host went down.
A possible storage issue, I am in a raid 0 of two sas ssd 1.6TB using the Dell PERC H730P Mini.
Local storage right? If the storage is blocking everything, it might be a cause indeed.
Yes, it is a local storage. I remember Having the same kind of issue on my home server : Alienware X51 R2. I am trying to generate a lot of vm see if anything happen using tail -f.
iowait could be also interesting to see if your storage is blocking everything.
Storage from the node 2
IOwait is really high before 9:47, if it's when you had the issue it can explain it
Is there any drivers that I need to install for the raid controller? I didn't had this issue with proxmox 5 or 6, even the IOWait was very low or non-existant.
Anything in the logs? dmesg
, daemon.log
, kern.log
etc.
In daemon.log : Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /control/feature-balloon <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/uncooperative <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/memory-offset <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-min <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/target <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-max <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/initial-reservation <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /data/updated <- None Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.32] watch /memory/uncooperative <- Oct 22 10:24:30 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/uncooperative <- Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.40] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.49] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.60] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.69] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed[1308]: [1426.78] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed[1308]: [1426.89] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed[1308]: [1426.97] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.10] watch /control/feature-balloon <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /control/feature-balloon <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/uncooperative <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/uncooperative <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/memory-offset <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/memory-offset <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/dynamic-min <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-min <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/target <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/target <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/dynamic-max <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-max <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/initial-reservation <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/initial-reservation <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /data/updated <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /data/updated <- None Oct 22 10:24:33 Poweredge1 tapback[9637]: backend.c:1114 domain removed, exit Oct 22 10:24:41 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read")
You can find a lots of error like that.
At this point you can see that I am manually restarting the node.
Something is happening, but i don't understand what is causing that, the issue is present on both host only with xenserver/xcp-ng, is there any driver or configuration I need to do on the host side for my setup?
If you look in the kern.log, it seems that the network interface goes completely crazy and throw a lot of error...
I did a test after looking at the logs.
I removed the NIC from the VM, started and stoped them several time in a row.
It seems that the issue is related to the network.
The VM connect to VM router/firewall appliance which is using IPFire (similare to pfense). I will try to dig this.
Broadcom NIC by chance?
bnx2x
:cry:
It's indeed problably the issue here. @stormi do we have any more recent bnx2x
driver?
A bit of googling took me here: https://bugs.xenserver.org/secure/attachment/11507/dmesg.txt
So it's probably reported to Citrix somehow. I bet on a Broadcom driver issue/bug for this module version.
After changing the NIC to from RLT819 to intel e1000 seems to stabilize the server. I have change all the vm and appliance NIC to e1000, started and restarted a few time all the vm. Everything is working fine for now. I will be doing a bit more testing to see if ti is completely solved.
So it means your VM aren't using PV drivers but emulated hardware, right?
yes they are PVHM, is it better to move to PV?
I was talking about PV drivers. A Windows VM can't be fully PV by design (this would require a modified Windows kernel).
If you change emulated NIC (RTL to e1000), this is not visible in a VM with PV drivers, because it doesn't use any of those, but directly a Xen PV driver for network. This is visible in your device manager. Can you double check there if you see a Xen device?
If you have PV drivers already installed, then it means during boot time it's using the emulated (just before Windows loads the PV driver) and this cause an issue on bnx module. A bit weird but everything is possible with Broadcom :/
Yes the PV drivers are installed on the Windows image.
It seems that a bug report was already filled for Xenserver : https://bugs.xenserver.org/browse/XSO-929 It also seems to be going nowhere... If you have more recent drivers for my NIC that would be great!
I have changed all the network adaptator to e1000 and so far no issues, even after stopping and starting vms several times. I will be conducting more test. Is there any new drivers available for my NIC cards?
So it might be a bug on VM boot stage. During the initial boot, before the OS kernel loads the drivers, VM is started with emulated hardware. So it might be a combo of RTL emulated + broadcom chips during boot phase that trigger your problem.
Note that after kernel loaded, emulated hardware is no longer used.
Our bnx2x driver has version 1.714.24 according to modinfo, which is lower than the default version that comes with the kernel: 1.712.30-0, so we could easily switch to that version for testing. We don't have any other more recent driver available for bnx2x.
That might worth the try, now @dv-longinus knows how to reproduce the problem.
So, to switch to the default kernel driver:
mv /lib/modules/4.19.0+1/updates/bnx2x.ko{,.save}
depmod -a
dracut -f /boot/initrd-4.19.0+1.img 4.19.0+1
reboot
To revert to previous state:
mv /lib/modules/4.19.0+1/updates/bnx2x.ko{.save,}
depmod -a
dracut -f /boot/initrd-4.19.0+1.img 4.19.0+1
reboot
I experienced another crash today, I am currently testing your steps, let's hope this fixes the issue :)
On xcp-ng 7.6 and 8.0 my setup is losing connectivity (even AMT) after a randomic time, from one day to one month.
I have an X11SSV-Q paired with a Core I7 7700T. The two NICS Intel 210 and 219 are bonded together. I installed in it a PERC H200 that is passthrough one of the VM.
I looked around but nothing seems exactly my problem nor for XenServer nor for XCP-ng