xcp-ng / xcp

Entry point for issues and wiki. Also contains some scripts and sources.
https://xcp-ng.org
1.31k stars 74 forks source link

Loosing connectivity and AMT #292

Closed etomm closed 4 years ago

etomm commented 5 years ago

On xcp-ng 7.6 and 8.0 my setup is losing connectivity (even AMT) after a randomic time, from one day to one month.

I have an X11SSV-Q paired with a Core I7 7700T. The two NICS Intel 210 and 219 are bonded together. I installed in it a PERC H200 that is passthrough one of the VM.

I looked around but nothing seems exactly my problem nor for XenServer nor for XCP-ng

olivierlambert commented 5 years ago

I would recommend to start here: https://github.com/xcp-ng/xcp/wiki/Troubleshooting dmesg can be helpful to see if it's kernel/driver related (or something else)

When the issue is pretty vague like this, the forum might be a better place to start investigation :+1: (more people there, this place is more to write identified issues)

ghost commented 5 years ago

I am happy to see that I am not the only one, I also had issue with the version 7.4 - 7.6 and now 8.0. On my side the issue started when a user generated 10 Vm out of a windows 10 template I created (he used the self service), then the first node becomes unavailable, I can't ping it, I can't do ssh in XOA I got this error : connect EHOSTUNREACH 10.16.2.100:443

The two nodes are two Poweredge R630 with 40 CPU, 192 Go of ram each and two ssd 1.6 in raid 0 (this is a lab, it is ok if thing breaks), all the firmware and bios are up to date. When I go on the server console it is lagging, when rebooting it it stay stuck at the splash screen.

I don't understand what is going on...

olivierlambert commented 5 years ago

Only dmesg or other logs will be useful, otherwise IDK how we could help…

ghost commented 5 years ago

Hi Olivier,

Please find the logs folder (with all the logs) log.zip

I am really scratching my head to find out why this is happening.

This issue was also happening on an Alienware X51 R2 (i7 4790k + 16 go of ram + ssd 512 GB samsung 850 pro)

The only common thing between the Lab servers and the home lab is that I have changed the local sr file system into ext4 using your drivers.

Could this cause that kind of issue?

ghost commented 5 years ago

I was able to replicate it, when I start 10 VM together, then the connection is lost. I have no idea why...

etomm commented 5 years ago

My local storage SR is LVM type. I did not change it to ext4. BUT I have too 7 vm. I switched from OpenVSwitch to Linux Bridge.

This is the log folder, but it is quite big and I shared it from drive https://drive.google.com/file/d/1VT716EIXPT0FXWIfnBOXE8AS4cfXPgFv/view

ghost commented 5 years ago

SO each I have above 8 to 10 vm running, I loose the connection with the host and I have to hard reboot it. any idea what it could be? Regarding the networking I am giving Ip addresses to the vm thought a linux router/firewall appliance (Ipfire).

olivierlambert commented 5 years ago

Can you:

ghost commented 5 years ago

disabling any powersaving feature and the Cstates solved the issue.

I can start all the VM the server works just fine...

Do you know why theses feature are causing any issue with xcp-ng?

Is this a bug or more a requirement to disable them?...

olivierlambert commented 5 years ago

Can you tell me which BIOS feature you disabled exactly and if trying to boot 10 VMs at once is now working?

ghost commented 5 years ago

So in the bios of the first node : Processor settings : X2Apic Mode set to enabled

In System Profile : Set the profile to customer CPU Power Management to performance C1E set to disabled C States set to disabled Uncore frequency set to maximum Energy Efficient Policy set to performance

I can start all the VM at once as a stress test it and it work just fine, before the host would just stop responding.

Funny enough the second node was already configured like this and was not having any issue.

etomm commented 5 years ago

This night I could try disabling the power management. I should say that with 7.6 I was having the problem after one month. With 8.0 updated with latest patches the problem is happening almost in 1 day.

Seems a different problem than the one of @dv-longinus.

ghost commented 5 years ago

I spoke too fast. It was fine for 2 days then it started again. I had to restart the first node this morning. I really need you help to try to figure out what is going on. I can't find anything special in the logs...

olivierlambert commented 5 years ago

I know it sounds a bit weird but have you tried to run a memtest on this machine?

ghost commented 5 years ago

I was running Proxmox on this host with more than 20 running vm per host, if it was a memory issue I would have seen it with proxmox. Do you still want me to do a memtest?

olivierlambert commented 5 years ago

Maybe XCP-ng could access memory a bit differently (ie in other "zones"/range than ProxMox). We don't have any other report of crashes like that. Especially since you are using a very common server (Dell R630 is widespread in XCP-ng world), it seems strange that your problem isn't more common if it was 100% software related.

There's also BIOS/firmwares but you said they are already up to date. I'm pinging @Fohdeesha for feedback on Dell issues.

Also, if you can configure a serial output it might be helpful to catch some messages (@rushikeshjadhav might be able to help on this)

ghost commented 5 years ago

Interesting, I will run a memtest on this node. Yes all the firmware and the bioses are up to date on both nodes. I will report back on the memtest is done.

Fohdeesha commented 5 years ago

The only xen related Dell issues I'm aware of were some iommu issues back on the 11th gen (eg r710). They were remedied in the 12th gen and above so I know the r630 is good, I know one Colo provider running xcp-ng on them without issue so I'm not sure what could be going on. I would run the lifecycle controller to ensure all the firmware is actually up to date (people miss a lot of components doing it manually), and run memtest at least overnight

ghost commented 5 years ago

I double checked all the firmware, and updated the NIC firmware, everything else was already up to date.

I am now running a memory test using the memo diagnostic using the lifecycle tool then i will go for memtest

ghost commented 5 years ago

After the lifecycle memtest and two passes of memtest. no memory error is shown.

What could cause the server to be unreachable when I launch to many VM, or when too many Vm are running?

I will try to see if I can reproduce the issue on the second node.

olivierlambert commented 5 years ago

Frankly, I don't know, because I never encountered that issue before. If you can reproduce, I'd like to know from which level of parallel VM boot you start to experience the issue.

ghost commented 5 years ago

SO on the first node when I launch around 10 VM, the node became unreachable. The VM are being created out of sysprep image of differents windows template and they are fast clone of the template.

ghost commented 5 years ago

I was able to reproduce the issue on the second node with around the same a mount of VM running.

olivierlambert commented 5 years ago

Okay somewhat it's a good news. If you manage to reproduce the issue, it means we could find the problem :)

  1. Is the problem starting exactly when you boot the VM, or during VM OS load phase?
  2. Via XO Hub, you can try to install our Alpine Linux template, which is small. Make 10 fast clone, and try to boot. This way we can try to see if the same problem arise.
  3. During the boot, if you tail -f your logs, anything specific is displayed before it's unresponsive?
ghost commented 5 years ago

Very interesting I tried to create two VM (ubuntu 18.04 template and Windows 10 template), as a full copy.

The host went down.

A possible storage issue, I am in a raid 0 of two sas ssd 1.6TB using the Dell PERC H730P Mini.

olivierlambert commented 5 years ago

Local storage right? If the storage is blocking everything, it might be a cause indeed.

ghost commented 5 years ago

Yes, it is a local storage. I remember Having the same kind of issue on my home server : Alienware X51 R2. I am trying to generate a lot of vm see if anything happen using tail -f.

olivierlambert commented 5 years ago

iowait could be also interesting to see if your storage is blocking everything.

ghost commented 5 years ago

2019-10-22_09-55-48 Storage from the node 2

olivierlambert commented 5 years ago

IOwait is really high before 9:47, if it's when you had the issue it can explain it

ghost commented 5 years ago

Is there any drivers that I need to install for the raid controller? I didn't had this issue with proxmox 5 or 6, even the IOWait was very low or non-existant.

olivierlambert commented 5 years ago

Anything in the logs? dmesg, daemon.log, kern.log etc.

ghost commented 5 years ago

In daemon.log : Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /control/feature-balloon <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/uncooperative <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/memory-offset <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-min <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/target <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-max <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/initial-reservation <- None Oct 22 10:24:16 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /data/updated <- None Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed[1308]: [1415.81] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:20 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.32] watch /memory/uncooperative <- Oct 22 10:24:30 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/uncooperative <- Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.40] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.49] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.60] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed[1308]: [1426.69] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:30 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed[1308]: [1426.78] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed[1308]: [1426.89] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed[1308]: [1426.97] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:31 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.10] watch /control/feature-balloon <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /control/feature-balloon <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/uncooperative <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/uncooperative <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/memory-offset <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/memory-offset <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/dynamic-min <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-min <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/target <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/target <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/dynamic-max <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/dynamic-max <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /memory/initial-reservation <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /memory/initial-reservation <- None Oct 22 10:24:33 Poweredge1 squeezed[1308]: [1429.11] watch /data/updated <- None Oct 22 10:24:33 Poweredge1 squeezed: [debug|Poweredge1|3 ||xenops] watch /data/updated <- None Oct 22 10:24:33 Poweredge1 tapback[9637]: backend.c:1114 domain removed, exit Oct 22 10:24:41 Poweredge1 squeezed: [error|Poweredge1|2 ||xenops] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 3 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 13 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 12 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 5 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 4 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 9 /memory/uncooperative = failed: Xs_protocol.Enoent("read") Oct 22 10:24:41 Poweredge1 squeezed[1308]: [1436.99] xenstore-write 10 /memory/uncooperative = failed: Xs_protocol.Enoent("read")

You can find a lots of error like that.

At this point you can see that I am manually restarting the node.

ghost commented 5 years ago

kern.log

daemon.log dmesg.log

Something is happening, but i don't understand what is causing that, the issue is present on both host only with xenserver/xcp-ng, is there any driver or configuration I need to do on the host side for my setup?

ghost commented 5 years ago

If you look in the kern.log, it seems that the network interface goes completely crazy and throw a lot of error...

ghost commented 5 years ago

I did a test after looking at the logs.

I removed the NIC from the VM, started and stoped them several time in a row.

It seems that the issue is related to the network.

The VM connect to VM router/firewall appliance which is using IPFire (similare to pfense). I will try to dig this.

Fohdeesha commented 5 years ago

Broadcom NIC by chance?

olivierlambert commented 5 years ago

bnx2x :cry:

It's indeed problably the issue here. @stormi do we have any more recent bnx2x driver?

olivierlambert commented 5 years ago

A bit of googling took me here: https://bugs.xenserver.org/secure/attachment/11507/dmesg.txt

So it's probably reported to Citrix somehow. I bet on a Broadcom driver issue/bug for this module version.

ghost commented 5 years ago

After changing the NIC to from RLT819 to intel e1000 seems to stabilize the server. I have change all the vm and appliance NIC to e1000, started and restarted a few time all the vm. Everything is working fine for now. I will be doing a bit more testing to see if ti is completely solved.

olivierlambert commented 5 years ago

So it means your VM aren't using PV drivers but emulated hardware, right?

ghost commented 5 years ago

yes they are PVHM, is it better to move to PV?

olivierlambert commented 5 years ago

I was talking about PV drivers. A Windows VM can't be fully PV by design (this would require a modified Windows kernel).

If you change emulated NIC (RTL to e1000), this is not visible in a VM with PV drivers, because it doesn't use any of those, but directly a Xen PV driver for network. This is visible in your device manager. Can you double check there if you see a Xen device?

If you have PV drivers already installed, then it means during boot time it's using the emulated (just before Windows loads the PV driver) and this cause an issue on bnx module. A bit weird but everything is possible with Broadcom :/

ghost commented 5 years ago

Yes the PV drivers are installed on the Windows image. 2019-10-22_15-09-18

It seems that a bug report was already filled for Xenserver : https://bugs.xenserver.org/browse/XSO-929 It also seems to be going nowhere... If you have more recent drivers for my NIC that would be great!

ghost commented 5 years ago

I have changed all the network adaptator to e1000 and so far no issues, even after stopping and starting vms several times. I will be conducting more test. Is there any new drivers available for my NIC cards?

olivierlambert commented 5 years ago

So it might be a bug on VM boot stage. During the initial boot, before the OS kernel loads the drivers, VM is started with emulated hardware. So it might be a combo of RTL emulated + broadcom chips during boot phase that trigger your problem.

Note that after kernel loaded, emulated hardware is no longer used.

stormi commented 5 years ago

Our bnx2x driver has version 1.714.24 according to modinfo, which is lower than the default version that comes with the kernel: 1.712.30-0, so we could easily switch to that version for testing. We don't have any other more recent driver available for bnx2x.

olivierlambert commented 5 years ago

That might worth the try, now @dv-longinus knows how to reproduce the problem.

stormi commented 5 years ago

So, to switch to the default kernel driver:

mv /lib/modules/4.19.0+1/updates/bnx2x.ko{,.save}
depmod -a
dracut -f /boot/initrd-4.19.0+1.img 4.19.0+1
reboot

To revert to previous state:

mv /lib/modules/4.19.0+1/updates/bnx2x.ko{.save,}
depmod -a
dracut -f /boot/initrd-4.19.0+1.img 4.19.0+1
reboot
ghost commented 5 years ago

I experienced another crash today, I am currently testing your steps, let's hope this fixes the issue :)