Closed hkparker closed 4 years ago
Could that be connected to discard? There have been some problems where unmapping/discarding sectors on SSDs made them freeze. Did you try to look for a firmware update?
Interesting thought, I'll look at the firmware version. I went to check the firmware version and noticed that the device has completely disappeared from the system. I'm waiting on a VM running zfs (I'm passing through an HBA) to finish a scrub before I reboot. After the reboot I should be able to see the device again and investigate. I would be somewhat surprised if it's the firmware, since it would be surprising to me if xen is doing anything specific with regards to how it interacts with the disk (and I would expect these disks to be stable).
Anyway, I checked out dmesg
and noticed this when the device disappeared.
[65056.815294] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[65060.797874] nvme 0000:04:00.0: Refused to change power state, currently in D3
[65060.815452] xen: registering gsi 32 triggering 0 polarity 1
[65060.815473] Already setup the GSI :32
[65060.937775] nvme nvme0: Removing after probe failure status: -19
[65060.950019] print_req_error: I/O error, dev nvme1n1, sector 895222784
[65060.950022] print_req_error: I/O error, dev nvme1n1, sector 438385288
[65060.950040] print_req_error: I/O error, dev nvme1n1, sector 223301496
[65060.950072] print_req_error: I/O error, dev nvme1n1, sector 256912800
[65060.950077] print_req_error: I/O error, dev nvme1n1, sector 189604552
[65060.950085] print_req_error: I/O error, dev nvme1n1, sector 390062504
[65060.950087] print_req_error: I/O error, dev nvme1n1, sector 453909496
[65060.950099] print_req_error: I/O error, dev nvme1n1, sector 453915072
[65060.950102] print_req_error: I/O error, dev nvme1n1, sector 246194176
[65060.950107] print_req_error: I/O error, dev nvme1n1, sector 246194288
[65061.030575] nvme nvme0: failed to set APST feature (-19)
Searching around these logs led me to this (which led me to this), so looks like someone else has experienced this issue as well! Unfortunately the forum post was never resolved, but this increases my confidence there is a software issue with 2TB NVMe drives.
Xen itself is probably doing nothing. Most likely it's a kernel-/driver problem then. I neither see Xen nor tapdisk being responsible for a disappearing/stucking device.
Did you test kernel-alt? It's known to solve some bugs.
I have not, I'll try it next boot. Unfortunately I have no reliable way to cause this issue to test if any change fixes it. I'd have to boot the new kernel and leave it running for a couple weeks before deciding that was enough time that it must have fixed it.
Finding the causing difference between stock- and alt-kernel would probably anyways a task for one of the developers. ;-) But it's clear that it would take some time to observe, if the error doesn't come back quick. It's always easier to say it didn't fix than saying it fixed a 'non-reproducible' error.
I have posted in XCPng forum also, but I'm not sure this is related to XCPng.
We recently had a similar problem on a client server, Dell R720 running XCP 8.0. Apart from the lockups from dmesg, "nvme list" does not shows the "bad" drive. The nvme's are Samsung 970 Pro 512Gb and at the failure moment iDrac logged this:
Fri Aug 07 2020 08:21:19 A bus fatal error was detected on a component at bus 64 device 3 function 0.
Fri Aug 07 2020 08:21:17 A bus fatal error was detected on a component at slot 3.
It's a third time this happens this year and we moved the pci-e slot and as you see, it happened again. After a server reboot everything seems to be working fine, but it's hard to pick a good time for reboot or maintenance the server as it must be running 24/24.
Very likely a hardware problem. Try again to reseat the device and update its firmware.
I just confirmed the firmware version is the latest.
As much as I'd like this to be a hardware issue, I'm pretty sure it's not, considering that I've tried this with two separate drives (reseating them multiple times) and the odds of getting two in a row that fail the same way seems lower than the likelihood there's a bug somewhere in the stack. Plus everyone else seeing this too.
I think your post on the forum @olivierlambert about it likely being a driver or kernel bug is most likely correct. What's interesting about that is that I don't see this issue with my older 512GB 950 Pro. I'd imagine the drivers are the same for both devices (nvme), so I speculate it has either to do with size, or the kernel is doing something different with the newer device that I don't understand. I'll try the alt kernel next, reach out to samsung, and continue researching this elsewhere. If I find anything I'll post here.
Possibly related. The arch wiki alludes to this as well.
On my host:
[11:50 localhost ~]# cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
100000
Perhaps I should try this first. If it is a power management thing it might be smart to automatically add this kernel parameter in future versions.
I added the kernel parameter and rebooted. Since I can't cause this issue repeatably it's going to be hard to say if this works as a fix, but I've never made it more than 5 days without the drive disappearing, so I'd say if I can make it two weeks I'd call that fixed. I'll report back here if I learn anything else as I continue to look into this and talk to vendors.
We have reseated both drives the last two times the issue appeared (the last time I was remote and issued only a reboot). I've issued
echo 5500 > /sys/module/nvme_core/parameters/default_ps_max_latency_us
on the server and now we'll wait to see if the error will come up again.
PS: default_ps_max_latency_us was 100000 at boot.
We have another client with a Dell R720 and 2 x 1Tb Samsung 970 pro nvme and same behavior, but without the idrac warning, only errors in dmesg, like the other one:
[5889534.306386] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[5889555.706907] nvme 0000:04:00.0: Refused to change power state, currently in D3
[5889557.107146] nvme nvme0: Removing after probe failure status: -19
[5889557.136469] print_req_error: I/O error, dev nvme1n1, sector 646145507
[5889557.136470] print_req_error: I/O error, dev nvme1n1, sector 652012968
[5889557.136483] zio pool=a-nvme vdev=/dev/disk/by-id/nvme-Samsung_SSD_970_PRO_1TB_S462NF0M503548M-part1 error=5 type=2 offset=333829591040 size=49152 flags=40080c80
The difference is the iDrac firmware version, first client server is a little older and second server is a more busy one than the first. On a same Dell R720xd with same firmware as the second server and 2 x Plextor 1Tb PX-1TM8PeY nvme drives this has not occurred since this server is running, a few years now.
As this seems to be a Samsung nvme controller issue (Refused to change power state, currently in D3) we will update the iDrac firmware on the first server, use 5500 as default_ps_max_latency_us on the first and second Dell servers and will wait to see if the problem will be back.
Edit: I was editing the code section - but failed to insert new line after every line
You need to use the correct Markdown syntax with three ticks (I edited your post)
Thank you!
It works
nvme_core.default_ps_max_latency_us=5500
disabled the lowest state, according to the arch wiki, while nvme_core.default_ps_max_latency_us=0
disables APST entirely. I'm curious if anyone here is strongly opinionated on choosing one over the other? I disabled it entirely, which I wouldn't think is fine but I'm wondering if there's some side effect of that I'm not aware of, perhaps a shorter lifespan? My drive is apparently rated for ~117 years mean failure so... I probably don't have to worry.
Amusing commit message gives the impression there's a lot of board/drive combos out there suffering from this. Is there somewhere we should be reporting these known incompatibilities? Linux mailing list?
I've been installing some other stuff in my hypervisor so I haven't started my two-week stability test yet. That begins today, I've got a good feeling but we'll see how it goes.
Forgot to post, here's some good high-level reading on NVMe power states.
Disabling powerstates mostly highers energy usage for not heavily used devices. That's it. There's (usually) no relevant life shortening etc. Powersavings have a whole history of problems. Hardware-, firmware-, driverbugs... the list is big as the bible, I guess.
Thanks for the feedback @hkparker it seems there's indeed some funny things happening in this case. Interesting thing is that's due to the interaction between dom0 kernel and this hardware, it's outside virtualization scope… Not the place I'm usually hearing problems :D
Hah, yeah. It appears that this issue had nothing to do with xcp-ng, just the wrong combination of NVMe drive and kernel. I just happened to experience it in my xcp-ng install. I'll close this out once I can confirm it was the power state issue with a couple weeks of stability, but I think we've made a really solid case for that being the root cause. Hopefully anyone else running xcp-ng who comes across this will be able to find this thread on google and get it resolved as well.
Thanks again :+1: You could also contribute directly and add a section in the https://xcp-ng.org/docs/troubleshooting.html page (at the bottom, there's a link to edit it directly).
This way, the official doc will be improved with this and so other people won't have to rely on a Google search no know what's going on :+1:
Yeah I'd love to help with the docs. xcp-ng is such an important of my project, any way I can give back is nice. I'll start with the USB controller issue we were discussing on the forum since I'm still waiting on some uptime with this drive before I'm 100% confident we figured it out. Should have something up in a couple days.
Great, thanks!
Alright, opened two PRs for documentation, this NVMe issue and one for the USB PCI passthrough thing. We can discuss any changes you'd like to see on those PRs, so I think I can close this out. Thanks to everyone for helping with the debug process!
Hey everyone,
I'm seeing frequent lockups of an lvm NVMe SR in xcp-ng 8.1. Getting petty confident this is a software issue. For a little context, here's the evolution of my setup...
I was running xcp-ng 7.6 for about a year with my VMs stored on a lvm SR created on a 512GB Samsung 950 PRO connected to my motherboard via an Addonics M2 adapter. This was rock solid.
A month or so ago I upgraded my installation to xcp-ng 8.1, which went flawlessly. Had a couple weeks of stability but then decided I wanted to expand my storage capabilities and migrate my VMs onto a new SSD. I purchased a StarTech PEX8M2E2 and a 2TB Samsung 970 EVO PLUS. I got both drives (the old 512GB and the new 2TB) installed on the new StarTech card, created a new local lvm SR on the 2TB drive, and migrated VMs over. At first everything seemed great.
A couple days into this setup, however, all my VMs locked up and I started seeing tapdisk errors. Here's a more recent sample:
A few minutes later you start to see stuff like this
and this continues to log until a reboot, at which point everything comes back up fine (for a couple days).
I suspected the StarTech card and began running some experiments, mixing and matching hardware to isolate where the issue was. I found that the new 2TB drive is unstable in the old Addonics card, it would lock up with the same errors after a couple days. When I had some VMs running on the old 512GB drive and some on the new 2TB drive, both connected via the StarTech card, only the VMs on the new 2TB drive would lock up after a couple days. So it's not the StarTech card, it's the 2TB NVMe drive.
I RMA'd it and had it replaced, thinking I was just unlucky enough to get a bad drive, but now with the new drive the behavior persists. For some reason xcp-ng 8.1 with a lvm 2TB NVMe SR appears to freeze the disk with tapdisk errors after a couple days. I'm at a loss here, not sure what would be causing this or how to proceed. I really doubt I got a bad drive that fails in the same way twice in a row.
Anyone else seen something like this? Any recommendations for how to look into this further? If we can drive this to completion I'm happy to pay for a pro support plan for this issue.