xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
360 stars 171 forks source link

Help Request: PXE freezes during node deployment #7461

Open speedymiata opened 1 month ago

speedymiata commented 1 month ago

I'm trying to use xcat to deploy rhel 8.9 onto a compute node, but the compute node fails to finish booting at this point:

Configuring (net0 ac:1f:6b:bc:db:ec)...... ok
net0: 192.168.32.12/255.255.240.0 gw 192.168.47.245
net0: fe80::ae1f:6bff:febc:dbec/64
Next server: 192.168.47.245
Filename: http://192.168.47.245:80/tftpboot/xcat/xnba/nets/192.168.32.0_20.uefi
http://192.168.47.245:80/tftpboot/xcat/xnba/nets/192.168.32.0_20.uefi... ok
192.168.32.0_20.uefi : 304 bytes [script]
http://192.168.47.245:80/tftpboot/xcat/genesis.kernel.x86_64... ok
http://192.168.47.245:80/tftpboot/xcat/genesis.fs.x86_64.gz... ok

During this process, I see this on the xcat head node:

[root@xcat_adm ~]# xcatprobe osdeploy -n cn01
The install NIC in current server is ib0                                                                       [INFO]
All nodes to be deployed are valid                                                                             [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[cn01] 13:56:23 Receive DHCPDISCOVER via ens2f0
[cn01] 13:56:24 Send DHCPOFFER on 192.168.32.72 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:26 DHCPREQUEST for 192.168.32.72 (192.168.47.245) from ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:26 Send DHCPACK on 192.168.32.72 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:26 Via TFTP download xcat/xnba.efi
[cn01] 13:56:27 Via TFTP download xcat/xnba.efi
[cn01] 13:56:30 Receive DHCPDISCOVER via ens2f0
[cn01] 13:56:31 Send DHCPOFFER on 192.168.32.12 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:31 DHCPREQUEST for 192.168.32.12 (192.168.47.245) from ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:31 Send DHCPACK on 192.168.32.12 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:56:39 Via HTTP get /tftpboot/xcat/xnba/nets/192.168.32.0_20.uefi
[cn01] 13:56:39 Via HTTP get /tftpboot/xcat/genesis.kernel.x86_64
[cn01] 13:56:39 Via HTTP get /tftpboot/xcat/genesis.fs.x86_64.gz
[cn01] 13:57:23 Receive DHCPDISCOVER via ens2f0
[cn01] 13:57:24 Send DHCPOFFER on 192.168.32.28 back to ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:57:24 DHCPREQUEST for 192.168.32.28 (192.168.47.245) from ac:1f:6b:bc:db:ec via ens2f0
[cn01] 13:57:24 Send DHCPACK on 192.168.32.28 back to ac:1f:6b:bc:db:ec via ens2f0

I still have a lot to learn about xcat, so I'll be extremely grateful for any and all help that's offered.

Additional information:

[root@xcat_adm ~]# lsdef -t node cn01
Object name: cn01
    arch=x86_64
    bmc=192.168.36.48
    cons=ipmi
    consoleenabled=1
    currchain=boot
    currstate=install rhels8.9.0-x86_64-compute
    getmac=ipmi
    hostnames=cn01
    ip=192.168.84.248
    mac=ac:1f:6b:bc:db:ec
    mgt=ipmi
    netboot=xnba
    nicips.ib0=192.168.84.248
    nicips.ipmi=192.168.36.48
    nicips.eno1=192.168.36.248
    nicnetworks.eno1=ipmi-net
    nicnetworks.ib0=ib-net
    nictypes.eno1=Ethernet
    nictypes.ib0=InfiniBand
    os=rhels8.9.0
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    profile=compute
    provmethod=rhels8.9.0-x86_64-install-compute
    serialport=1
    serialspeed=115200
    status=powering-on
    statustime=08-01-2024 13:54:21
[root@xcat_adm ~]# lsdef -t osimage rhels8.9.0-x86_64-install-compute
Object name: rhels8.9.0-x86_64-install-compute
    imagetype=linux
    osarch=x86_64
    osdistroname=rhels8.9.0-x86_64
    osname=Linux
    osvers=rhels896.0
    partitionfile=s:/install/custom/partitionfile/rhels8.9.0-x86_64-install-compute_partitions.sh
    pkgdir=/install/rhels8.9.0/x86_64
    pkglist=/install/custom/pkglist/rhel8-pkglist-compute.pkglist
    postscripts=custom/rhel-8.9-postscript-compute.sh
    profile=compute
    provmethod=install
    template=/install/custom/template/rhels8.9.0-x86_64-install-compute.tmpl
speedymiata commented 1 month ago

Its been a few days since I posted this help request. Is there another forum I should repost the request to?

Obihoernchen commented 1 month ago

Your config seems to be a little bit weird.

Do you really want to deploy the image via IPoIB?

    ip=192.168.84.248
    mac=ac:1f:6b:bc:db:ec
    nicips.ib0=192.168.84.248

The MAC address seems to be from the ethernet device but you specify a IPoIB IP for the node. ip needs to be your eno1 IP and match the MAC address you specify, not ib0.

What does makedhcp -q cn01 show?

speedymiata commented 1 month ago

I do not want to deploy the image via IPoIB. That's the joy of an inherited system, right there - I want to use the Ethernet network for image deployment.

makedhcp -a cn01 shows:

[root@xcat_adm ~]# makedhcp -q  cn01
cn01: ip-address = 192.168.84.248, hardware-address = ac:1f:6b:bc:db:ec

I also went ahead and executed chdef cn01 ip=192.168.36.248 to try to get the system to deploy the image over the Ethernet network, but this didn't have the desired effect. The node still hangs at the same point in the boot process. What else do I need to do, to switch to Ethernet?

Obihoernchen commented 1 month ago

Did you run nodeset/rinstall cn01 osimage=rhels8.9.0-x86_64-install-compute afterwards? Furthermore, you should make sure the nodes boots via ETH first or disable IB PXE ROM. You may also want to disable DHCP for your IPoIB network with setting site.dhcpinterfaces to your Mgmt. node ethernet interface.

speedymiata commented 1 month ago

I used rinstall, yes, and I followed it up with xcatprobe osdeploy -n cn01. I'm also using rcons to manually select the Eth interface as the boot device - it is most certainly starting with it first.

After running rinstall, the makedhcp command's output still hasn't changed. It still shows the IB interface's IP of .84.248.

Obihoernchen commented 1 month ago

Oh sorry, yes you need to run makedhcp cn01 before. Then makedhcp -q cn01 should show the correct IP.

speedymiata commented 1 month ago

This seems odd. After running makedhcp cn01, re-running makedhcp -1 cn01 does not indicate that a change was made. The IB address is still present.

But according to my lsdef for this node, I've set ip to the Ethernet interface's address. Are there any other items I should check?

[root@xcat_adm ~]# lsdef  cn01
Object name: cn01
    arch=x86_64
    bmc=192.168.36.48
    cons=ipmi
    consoleenabled=1
    currchain=boot
    currstate=install rhels8.6.0-x86_64-compute
    getmac=ipmi
    hostnames=cn01
    installnic=ac:1f:6b:bc:db:ec
    ip=192.168.36.248
    mac=ac:1f:6b:bc:db:ec
    mgt=ipmi
    netboot=xnba
    nicips.ib0=192.168.84.248
    nicips.ipmi=192.168.36.48
    nicips.eno1=192.168.36.248
    nicnetworks.eno1=ipmi-net
    nicnetworks.ib0=ib-net
    nictypes.eno1=Ethernet
    nictypes.ib0=InfiniBand
    os=rhels8.6.0
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    profile=compute
    provmethod=rhels8.6.0-x86_64-install-compute-
    serialport=1
    serialspeed=115200
    status=powering-on
speedymiata commented 1 month ago

Here's what happened while "playing" with the makedhcp command after referencing the man page:

[root@wxcat_adm ~]# makedhcp -q  cn01
cn01: ip-address = 192.168.84.248, hardware-address = ac:1f:6b:bc:db:ec
[root@wxcat_adm ~]# makedhcp -d  cn01
[root@wxcat_adm ~]# makedhcp -q  cn01
[root@wxcat_adm ~]# makedhcp -n  cn01
Renamed existing dhcp configuration file to  /etc/dhcp/dhcpd.conf.xcatbak

Warning: [wxcat_adm]: No dynamic range specified for 192.168.80.0. If hardware discovery is being used, a dynamic range is required.
[root@wxcat_adm ~]# makedhcp -q  cn01
[root@wxcat_adm ~]# makedhcp   cn01
[root@wxcat_adm ~]# makedhcp -q  cn01
cn01: ip-address = 192.168.84.248, hardware-address = ac:1f:6b:bc:db:ec

I'll admit that I still have a lot to learn about xcat, but it still seems quite strange that its not "picking up" the IP address I've specified in the node definition. Is there something I have to refresh? Apply?