Open ruixingw opened 3 years ago
So this problem only seen during discovery ? Once the node definition is created, provisioning that node diskless or diskful is ok with those disks installed ?
So this problem only seen during discovery ? Once the node definition is created, provisioning that node diskless or diskful is ok with those disks installed ?
exactly. I only do diskful installation though.
Is the behavior the same if you boot the node into genesis by rinstall <node> shell
?
I don't use IPMI/BMC so I tried nodeset <node> shell
instead, which should be the same thing. -- and yes, it's the same behavior. I tried going through both Legacy PXE and UEFI PXE, but neither worked. Still freezes just after the PXE.
Run discovery without SSD installed --> Install SSD and provision OS --> OS is successfully provisioned and booted --> nodeset <node> shell
--> reboot the compute node --> boot from PXE and soon freezes with a black screen
During this I found another thing, not sure if it is an issue or not:
nodeset <node> shell
or nodeset <node> osimage=<osimage>
modify both /tftpboot/xcat/xnba/nodes/<node>
and /tftpboot/xcat/xnba/nodes/<node>.uefi
, which is okay, so I can boot from both legacy PXE and UEFI PXE; but nodeset <node> boot
only modifies the legacy one to exit
, leaving the UEFI one unchanged.
I'm not sure if this is intended, because even though the UEFI PXE file is left there, it was not loaded at all. The client just waits for server response until timeout, then boots from HDD, so it still boots up. If you think this is a bug, I'll submit a separate issue to track this (or you can do it too).
we just found that when a dedicated GPU is installed and is used for video output, we can see there is a kernel panic related to nvme.
I am using xCAT Version 2.16.2 (git commit a16ec7a7b05dfc37d6727121b1cbaed094a3ff04, built Thu May 20 17:08:12 EDT 2021) on CentOS 8.3.2011.
I usually provision compute nodes with NVMe SSDs. I found that, if compute nodes have certain SSDs installed, xCAT genesis just freezes from the beginning (soon after the PXE process and xCAT genesis just booted). The screen shows no useful information. sometimes a few lines of normal messages are printed, and sometimes it just freezes with a black screen.
The SSDs that encountered this problem are WD SN550 (250GB/500GB) and Samsung 980 (500GB) as far as I tested. we have a lot of these SSDs and installed many nodes, so it's not just a broken SSD/node. On the other hand, Samsung PM981 (500GB) does not have this problem.
As a workaround (and that's how I located the problem), I run the discovery process without SSD. After the node is successfully discovered, I install the SSD again and then provision the system.
I am not sure if you are aware of this problem and if there is a fix? Let me know if you want me to do more tests.