xcp-ng 8.1 tapdisk errors with 2TB NVMe

hkparker commented 4 years ago

Hey everyone,

I'm seeing frequent lockups of an lvm NVMe SR in xcp-ng 8.1. Getting petty confident this is a software issue. For a little context, here's the evolution of my setup...

I was running xcp-ng 7.6 for about a year with my VMs stored on a lvm SR created on a 512GB Samsung 950 PRO connected to my motherboard via an Addonics M2 adapter. This was rock solid.

A month or so ago I upgraded my installation to xcp-ng 8.1, which went flawlessly. Had a couple weeks of stability but then decided I wanted to expand my storage capabilities and migrate my VMs onto a new SSD. I purchased a StarTech PEX8M2E2 and a 2TB Samsung 970 EVO PLUS. I got both drives (the old 512GB and the new 2TB) installed on the new StarTech card, created a new local lvm SR on the 2TB drive, and migrated VMs over. At first everything seemed great.

A couple days into this setup, however, all my VMs locked up and I started seeing tapdisk errors. Here's a more recent sample:

Jul 29 13:11:16 hkparker-hypervisor tapdisk[5923]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-b510e6d5-db04-4593-94d1-c83439d4f58c: op: 2, lsec: 21858392, secs: 24, nbytes: 12288, blk: 5336, blk_offset: 20438167
Jul 29 13:11:16 hkparker-hypervisor tapdisk[8097]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-d3f51b09-8a22-42ba-947d-111ded382c3d: op: 1, lsec: 20214200, secs: 48, nbytes: 24576, blk: 4935, blk_offset: 20159095
Jul 29 13:11:16 hkparker-hypervisor tapdisk[7560]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-a4f7b310-2f4e-4b05-aadb-2b53bd8a85ba: op: 2, lsec: 8997552, secs: 8, nbytes: 4096, blk: 2196, blk_offset: 8992111
Jul 29 13:11:16 hkparker-hypervisor tapdisk[6457]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-2de0aad6-e328-42f3-80d8-fb26f5d7c360: op: 2, lsec: 21926344, secs: 8, nbytes: 4096, blk: 5353, blk_offset: 16756879
Jul 29 13:11:16 hkparker-hypervisor tapdisk[9344]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-a105ab4c-2c55-42b9-bc7b-b10ff0451291: op: 2, lsec: 9063920, secs: 8, nbytes: 4096, blk: 2212, blk_offset: 9057775
Jul 29 13:11:16 hkparker-hypervisor tapdisk[8670]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-7119e209-1245-4605-bce3-66f3ed3326b3: op: 2, lsec: 21351752, secs: 24, nbytes: 12288, blk: 5212, blk_offset: 20491519
Jul 29 13:11:16 hkparker-hypervisor tapdisk[5923]: ERROR: errno -5 at __tapdisk_vbd_complete_td_request: req xenvbd-3-768.8: write 0x0018 secs @ 0x014d8858 - Input/output error
Jul 29 13:11:16 hkparker-hypervisor tapdisk[7560]: ERROR: errno -5 at __tapdisk_vbd_complete_td_request: req xenvbd-5-768.17: write 0x0008 secs @ 0x00894ab0 - Input/output error
Jul 29 13:11:16 hkparker-hypervisor tapdisk[8097]: ERROR: errno -5 at __tapdisk_vbd_complete_td_request: req xenvbd-6-768.1c: read 0x0030 secs @ 0x013471b8 - Input/output error
Jul 29 13:11:16 hkparker-hypervisor tapdisk[9344]: ERROR: errno -5 at __tapdisk_vbd_complete_td_request: req xenvbd-8-768.11: write 0x0008 secs @ 0x008a4df0 - Input/output error

A few minutes later you start to see stuff like this

Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-11-768.1f timed out, retried 40 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-11-768.1f timed out, retried 40 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-11-768.d timed out, retried 40 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-11-768.d timed out, retried 40 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-11-768.9 timed out, retried 40 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-11-768.9 timed out, retried 40 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-1-768.a timed out, retried 41 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-1-768.a timed out, retried 41 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-1-768.3 timed out, retried 41 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-1-768.3 timed out, retried 41 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-1-768.18 timed out, retried 41 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at __tapdisk_vbd_request_timeout: req xenvbd-1-768.18 timed out, retried 41 times
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-8c482468-9bae-4623-b480-bd64e1bf3c72: op: 1, lsec: 7305896, secs: 8, nbytes: 4096, blk: 1783, blk_offset: 7297159
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-8c482468-9bae-4623-b480-bd64e1bf3c72: op: 1, lsec: 6951000, secs: 8, nbytes: 4096, blk: 1697, blk_offset: 6948319
Jul 29 13:14:18 hkparker-hypervisor tapdisk[4861]: tap-err:/dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-8c482468-9bae-4623-b480-bd64e1bf3c72: vhd_complete: too many errors, dropped.
Jul 29 13:14:18 hkparker-hypervisor tapdisk[5923]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-b510e6d5-db04-4593-94d1-c83439d4f58c: op: 2, lsec: 2048, secs: 8, nbytes: 4096, blk: 0, blk_offset: 8455
Jul 29 13:14:18 hkparker-hypervisor tapdisk[7560]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-a4f7b310-2f4e-4b05-aadb-2b53bd8a85ba: op: 2, lsec: 14448616, secs: 8, nbytes: 4096, blk: 3527, blk_offset: 14442223
Jul 29 13:14:18 hkparker-hypervisor tapdisk[7560]: ERROR: errno -5 at vhd_complete: /dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-a4f7b310-2f4e-4b05-aadb-2b53bd8a85ba: op: 1, lsec: 2290728, secs: 8, nbytes: 4096, blk: 559, blk_offset: 2286175
Jul 29 13:14:18 hkparker-hypervisor tapdisk[7560]: tap-err:/dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-a4f7b310-2f4e-4b05-aadb-2b53bd8a85ba: vhd_complete: too many errors, dropped.
Jul 29 13:14:18 hkparker-hypervisor tapdisk[12283]: tap-err:/dev/VG_XenStorage-a0a2c4a0-53e8-8ea1-a145-ce5680475171/VHD-9e10f810-5438-4a26-96a1-5c20da724360: vhd_complete: too many errors, dropped.

and this continues to log until a reboot, at which point everything comes back up fine (for a couple days).

I suspected the StarTech card and began running some experiments, mixing and matching hardware to isolate where the issue was. I found that the new 2TB drive is unstable in the old Addonics card, it would lock up with the same errors after a couple days. When I had some VMs running on the old 512GB drive and some on the new 2TB drive, both connected via the StarTech card, only the VMs on the new 2TB drive would lock up after a couple days. So it's not the StarTech card, it's the 2TB NVMe drive.

I RMA'd it and had it replaced, thinking I was just unlucky enough to get a bad drive, but now with the new drive the behavior persists. For some reason xcp-ng 8.1 with a lvm 2TB NVMe SR appears to freeze the disk with tapdisk errors after a couple days. I'm at a loss here, not sure what would be causing this or how to proceed. I really doubt I got a bad drive that fails in the same way twice in a row.

Anyone else seen something like this? Any recommendations for how to look into this further? If we can drive this to completion I'm happy to pay for a pro support plan for this issue.

nagilum99 commented 4 years ago

Could that be connected to discard? There have been some problems where unmapping/discarding sectors on SSDs made them freeze. Did you try to look for a firmware update?

hkparker commented 4 years ago

Interesting thought, I'll look at the firmware version. I went to check the firmware version and noticed that the device has completely disappeared from the system. I'm waiting on a VM running zfs (I'm passing through an HBA) to finish a scrub before I reboot. After the reboot I should be able to see the device again and investigate. I would be somewhat surprised if it's the firmware, since it would be surprising to me if xen is doing anything specific with regards to how it interacts with the disk (and I would expect these disks to be stable).

Anyway, I checked out dmesg and noticed this when the device disappeared.

[65056.815294] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[65060.797874] nvme 0000:04:00.0: Refused to change power state, currently in D3
[65060.815452] xen: registering gsi 32 triggering 0 polarity 1
[65060.815473] Already setup the GSI :32
[65060.937775] nvme nvme0: Removing after probe failure status: -19
[65060.950019] print_req_error: I/O error, dev nvme1n1, sector 895222784
[65060.950022] print_req_error: I/O error, dev nvme1n1, sector 438385288
[65060.950040] print_req_error: I/O error, dev nvme1n1, sector 223301496
[65060.950072] print_req_error: I/O error, dev nvme1n1, sector 256912800
[65060.950077] print_req_error: I/O error, dev nvme1n1, sector 189604552
[65060.950085] print_req_error: I/O error, dev nvme1n1, sector 390062504
[65060.950087] print_req_error: I/O error, dev nvme1n1, sector 453909496
[65060.950099] print_req_error: I/O error, dev nvme1n1, sector 453915072
[65060.950102] print_req_error: I/O error, dev nvme1n1, sector 246194176
[65060.950107] print_req_error: I/O error, dev nvme1n1, sector 246194288
[65061.030575] nvme nvme0: failed to set APST feature (-19)

Searching around these logs led me to this (which led me to this), so looks like someone else has experienced this issue as well! Unfortunately the forum post was never resolved, but this increases my confidence there is a software issue with 2TB NVMe drives.

nagilum99 commented 4 years ago

Xen itself is probably doing nothing. Most likely it's a kernel-/driver problem then. I neither see Xen nor tapdisk being responsible for a disappearing/stucking device.

Did you test kernel-alt? It's known to solve some bugs.

hkparker commented 4 years ago

I have not, I'll try it next boot. Unfortunately I have no reliable way to cause this issue to test if any change fixes it. I'd have to boot the new kernel and leave it running for a couple weeks before deciding that was enough time that it must have fixed it.

nagilum99 commented 4 years ago

Finding the causing difference between stock- and alt-kernel would probably anyways a task for one of the developers. ;-) But it's clear that it would take some time to observe, if the error doesn't come back quick. It's always easier to say it didn't fix than saying it fixed a 'non-reproducible' error.

gn-ro commented 4 years ago

I have posted in XCPng forum also, but I'm not sure this is related to XCPng.

We recently had a similar problem on a client server, Dell R720 running XCP 8.0. Apart from the lockups from dmesg, "nvme list" does not shows the "bad" drive. The nvme's are Samsung 970 Pro 512Gb and at the failure moment iDrac logged this:

Fri Aug 07 2020 08:21:19    A bus fatal error was detected on a component at bus 64 device 3 function 0.    
Fri Aug 07 2020 08:21:17    A bus fatal error was detected on a component at slot 3.

It's a third time this happens this year and we moved the pci-e slot and as you see, it happened again. After a server reboot everything seems to be working fine, but it's hard to pick a good time for reboot or maintenance the server as it must be running 24/24.

olivierlambert commented 4 years ago

Very likely a hardware problem. Try again to reseat the device and update its firmware.

hkparker commented 4 years ago

I just confirmed the firmware version is the latest.

As much as I'd like this to be a hardware issue, I'm pretty sure it's not, considering that I've tried this with two separate drives (reseating them multiple times) and the odds of getting two in a row that fail the same way seems lower than the likelihood there's a bug somewhere in the stack. Plus everyone else seeing this too.

I think your post on the forum @olivierlambert about it likely being a driver or kernel bug is most likely correct. What's interesting about that is that I don't see this issue with my older 512GB 950 Pro. I'd imagine the drivers are the same for both devices (nvme), so I speculate it has either to do with size, or the kernel is doing something different with the newer device that I don't understand. I'll try the alt kernel next, reach out to samsung, and continue researching this elsewhere. If I find anything I'll post here.

hkparker commented 4 years ago

Possibly related. The arch wiki alludes to this as well.

On my host:

[11:50 localhost ~]# cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
100000

Perhaps I should try this first. If it is a power management thing it might be smart to automatically add this kernel parameter in future versions.

hkparker commented 4 years ago

I added the kernel parameter and rebooted. Since I can't cause this issue repeatably it's going to be hard to say if this works as a fix, but I've never made it more than 5 days without the drive disappearing, so I'd say if I can make it two weeks I'd call that fixed. I'll report back here if I learn anything else as I continue to look into this and talk to vendors.

gn-ro commented 4 years ago

We have reseated both drives the last two times the issue appeared (the last time I was remote and issued only a reboot). I've issued echo 5500 > /sys/module/nvme_core/parameters/default_ps_max_latency_us on the server and now we'll wait to see if the error will come up again.

PS: default_ps_max_latency_us was 100000 at boot.

gn-ro commented 4 years ago

We have another client with a Dell R720 and 2 x 1Tb Samsung 970 pro nvme and same behavior, but without the idrac warning, only errors in dmesg, like the other one:

[5889534.306386] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[5889555.706907] nvme 0000:04:00.0: Refused to change power state, currently in D3
[5889557.107146] nvme nvme0: Removing after probe failure status: -19
[5889557.136469] print_req_error: I/O error, dev nvme1n1, sector 646145507
[5889557.136470] print_req_error: I/O error, dev nvme1n1, sector 652012968
[5889557.136483] zio pool=a-nvme vdev=/dev/disk/by-id/nvme-Samsung_SSD_970_PRO_1TB_S462NF0M503548M-part1 error=5 type=2 offset=333829591040 size=49152 flags=40080c80

The difference is the iDrac firmware version, first client server is a little older and second server is a more busy one than the first. On a same Dell R720xd with same firmware as the second server and 2 x Plextor 1Tb PX-1TM8PeY nvme drives this has not occurred since this server is running, a few years now.

As this seems to be a Samsung nvme controller issue (Refused to change power state, currently in D3) we will update the iDrac firmware on the first server, use 5500 as default_ps_max_latency_us on the first and second Dell servers and will wait to see if the problem will be back.

Edit: I was editing the code section - but failed to insert new line after every line

olivierlambert commented 4 years ago

You need to use the correct Markdown syntax with three ticks (I edited your post)

gn-ro commented 4 years ago

Thank you!
It works

hkparker commented 4 years ago

nvme_core.default_ps_max_latency_us=5500 disabled the lowest state, according to the arch wiki, while nvme_core.default_ps_max_latency_us=0 disables APST entirely. I'm curious if anyone here is strongly opinionated on choosing one over the other? I disabled it entirely, which I wouldn't think is fine but I'm wondering if there's some side effect of that I'm not aware of, perhaps a shorter lifespan? My drive is apparently rated for ~117 years mean failure so... I probably don't have to worry.

Amusing commit message gives the impression there's a lot of board/drive combos out there suffering from this. Is there somewhere we should be reporting these known incompatibilities? Linux mailing list?

I've been installing some other stuff in my hypervisor so I haven't started my two-week stability test yet. That begins today, I've got a good feeling but we'll see how it goes.

hkparker commented 4 years ago

Forgot to post, here's some good high-level reading on NVMe power states.

nagilum99 commented 4 years ago

Disabling powerstates mostly highers energy usage for not heavily used devices. That's it. There's (usually) no relevant life shortening etc. Powersavings have a whole history of problems. Hardware-, firmware-, driverbugs... the list is big as the bible, I guess.

olivierlambert commented 4 years ago

Thanks for the feedback @hkparker it seems there's indeed some funny things happening in this case. Interesting thing is that's due to the interaction between dom0 kernel and this hardware, it's outside virtualization scope… Not the place I'm usually hearing problems :D

hkparker commented 4 years ago

Hah, yeah. It appears that this issue had nothing to do with xcp-ng, just the wrong combination of NVMe drive and kernel. I just happened to experience it in my xcp-ng install. I'll close this out once I can confirm it was the power state issue with a couple weeks of stability, but I think we've made a really solid case for that being the root cause. Hopefully anyone else running xcp-ng who comes across this will be able to find this thread on google and get it resolved as well.

olivierlambert commented 4 years ago

Thanks again :+1: You could also contribute directly and add a section in the https://xcp-ng.org/docs/troubleshooting.html page (at the bottom, there's a link to edit it directly).

This way, the official doc will be improved with this and so other people won't have to rely on a Google search no know what's going on :+1:

hkparker commented 4 years ago

Yeah I'd love to help with the docs. xcp-ng is such an important of my project, any way I can give back is nice. I'll start with the USB controller issue we were discussing on the forum since I'm still waiting on some uptime with this drive before I'm 100% confident we figured it out. Should have something up in a couple days.

olivierlambert commented 4 years ago

Great, thanks!

hkparker commented 4 years ago

Alright, opened two PRs for documentation, this NVMe issue and one for the USB PCI passthrough thing. We can discuss any changes you'd like to see on those PRs, so I think I can close this out. Thanks to everyone for helping with the debug process!

xcp-ng / xcp

xcp-ng 8.1 tapdisk errors with 2TB NVMe #430