xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
363 stars 171 forks source link

xCAT genesis freezes when compute nodes have certain NVMe SSDs installed #6997

Open ruixingw opened 3 years ago

ruixingw commented 3 years ago

I am using xCAT Version 2.16.2 (git commit a16ec7a7b05dfc37d6727121b1cbaed094a3ff04, built Thu May 20 17:08:12 EDT 2021) on CentOS 8.3.2011.

I usually provision compute nodes with NVMe SSDs. I found that, if compute nodes have certain SSDs installed, xCAT genesis just freezes from the beginning (soon after the PXE process and xCAT genesis just booted). The screen shows no useful information. sometimes a few lines of normal messages are printed, and sometimes it just freezes with a black screen.

The SSDs that encountered this problem are WD SN550 (250GB/500GB) and Samsung 980 (500GB) as far as I tested. we have a lot of these SSDs and installed many nodes, so it's not just a broken SSD/node. On the other hand, Samsung PM981 (500GB) does not have this problem.

As a workaround (and that's how I located the problem), I run the discovery process without SSD. After the node is successfully discovered, I install the SSD again and then provision the system.

I am not sure if you are aware of this problem and if there is a fix? Let me know if you want me to do more tests.

gurevichmark commented 3 years ago

So this problem only seen during discovery ? Once the node definition is created, provisioning that node diskless or diskful is ok with those disks installed ?

ruixingw commented 3 years ago

So this problem only seen during discovery ? Once the node definition is created, provisioning that node diskless or diskful is ok with those disks installed ?

exactly. I only do diskful installation though.

gurevichmark commented 3 years ago

Is the behavior the same if you boot the node into genesis by rinstall <node> shell ?

ruixingw commented 3 years ago

I don't use IPMI/BMC so I tried nodeset <node> shell instead, which should be the same thing. -- and yes, it's the same behavior. I tried going through both Legacy PXE and UEFI PXE, but neither worked. Still freezes just after the PXE.

Run discovery without SSD installed --> Install SSD and provision OS --> OS is successfully provisioned and booted --> nodeset <node> shell --> reboot the compute node --> boot from PXE and soon freezes with a black screen

During this I found another thing, not sure if it is an issue or not: nodeset <node> shell or nodeset <node> osimage=<osimage> modify both /tftpboot/xcat/xnba/nodes/<node> and /tftpboot/xcat/xnba/nodes/<node>.uefi, which is okay, so I can boot from both legacy PXE and UEFI PXE; but nodeset <node> boot only modifies the legacy one to exit, leaving the UEFI one unchanged. I'm not sure if this is intended, because even though the UEFI PXE file is left there, it was not loaded at all. The client just waits for server response until timeout, then boots from HDD, so it still boots up. If you think this is a bug, I'll submit a separate issue to track this (or you can do it too).

ruixingw commented 2 years ago

we just found that when a dedicated GPU is installed and is used for video output, we can see there is a kernel panic related to nvme.

nvme-kernelpanic