xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
360 stars 171 forks source link

Nodestat reporting an incorrect osimage when net booting #7223

Open Emohseni opened 2 years ago

Emohseni commented 2 years ago

System is stateless compute nodes.. the systems boot to the correct OSimage but xcat nodesta command reports wrong osimage. This happens on multiple nodes./ prior to this run the osimage ending in t0 was removed from xcat tables but nodestat continues to report wrong netboot image as show below

nodeset xyz6cn-024 osimage=abc_xyz_compute_prod xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_prod xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_prod

cold boot of node nodestat reports netboot on image that is no longer defined

[root@xyz6mgt-001 2022/07/27 19:30:47]/install/abc/backup-2022-07-19_1906# nodestat xyz6cn-024 xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_t0

[root@xyz6mgt-001 2022/07/27 19:28:13]/install/abc/backup-2022-07-19_1906# nodestat xyz6cn-024 -m xyz6cn-024: nrpe,pbs,ssh Upon validation of node the node is booted to the correct os image and not the above listed _t0

gurevichmark commented 2 years ago

@Emohseni

Emohseni commented 2 years ago

Version 2.16.1.

[root@xyz6sn-001 ~]# lsdef xyz6cn-024Object name: xyz6cn-024    addkcmdline=nomodeset intel_pstate=passive clocksource=tsc tsc=reliable    arch=x86_64    bmc=xyz6cn-024-rm    chain=runcmd=bmcsetup,shell    chassis=xyz6smm-002    cons=ipmi    currstate=netboot rhels8.3.0-x86_64-abc_xyz_compute_prod    groups=r1,r2cn,cn,all,compute,sd650v2,ipmi    ip=    mac=   mgt=ipmi    netboot=xnba    nicextraparams.ib0=GATEWAY=  nichostnamesuffixes.ib0=-ib    nicips.ib0=    nicnetworks.ib0=ib    nictypes.ib0=Infiniband    nodetype=osi    ondiscover=nodediscover    os=rhels8.3.0    otherinterfaces=-rm:    postbootscripts=otherpkgs    postscripts=syslog,remoteshell,setupntp,syncfiles,confignetwork -s,abc/postinst.sh    profile=abc_xyz_compute_prod    provmethod=abc_xyz_compute_prod    serialport=0    serialspeed=115200    servicenode=xyz6sn-001,xyz6sn-002    slot=12    status=failed    statustime=07-27-2022 19:33:47    updatestatus=syncing    updatestatustime=07-07-2022 23:32:33    xcatmaster=xyz6sn-001-cn

after the node has finished the boot process shows the output

gurevichmark commented 2 years ago

I think netboot <osimage name> should only be reported while node is booting. Once the boot process is finished nodestat should display sshd. Perhaps your node did not cleanly finish the boot process ? I noticed status=failed. Check /var/log/xcat/xcat.log on the compute node to see if any errors were logged.

Emohseni commented 2 years ago

The reported error is the nodestat reports wrong OS image that is being booted and reported during netboot; when the node boots up succesfully it does report sshd and other services. rerunning nodeset does not change the output of nodestat even though the _t0 osimage definition was removed. from the cluster and is no longer shown with lsdef -t osimage

Emohseni commented 2 years ago

The error is nodestat does not report the current netboot osimage state of the node.

gurevichmark commented 2 years ago

@Emohseni

It looks like, while diskless node is booting, the nodestat gets the node status by calling nodeset <node> stat Can you run nodeset xyz6cn-024 stat, while the node xyz6cn-024 is booting and nodestat xyz6cn-024 reports incorrect osimage name.

Emohseni commented 2 years ago
[root@xyz6sn-001 ~]# nodeset xyz6cn-024 stat
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_t0
gurevichmark commented 2 years ago

How about ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* and grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*

Emohseni commented 2 years ago

Service node:

[root@xyz6sn-001 ~]# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024*
-rw-r--r-- 1 root root 584 Jul 27 19:28 /tftpboot/xcat/xnba/nodes/xyz6cn-024
-rw-r--r-- 1 root root 550 Jul 27 19:28 /tftpboot/xcat/xnba/nodes/xyz6cn-024.uefi

[root@xyz6sn-001 ~]# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* |grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
/tftpboot/xcat/xnba/nodes/xyz6cn-024:#netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod

Management node:

root@xyz6mgt-001 2022/07/28 18:53:08]/var/log/xcat#  ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* |grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
/tftpboot/xcat/xnba/nodes/xyz6cn-024:#netboot rhels8.3.0-x86_64-ssc_xyz_compute_t0
Emohseni commented 2 years ago

seems the /tftpboot is not being updated on the main management node or cleaned up. What is the action plan? Why mgt node not being updated with nodeset osimage?

gurevichmark commented 2 years ago

What is your sharedtftp and tftpdir setting in site table on Management node ?

Emohseni commented 2 years ago

Management node

[root@xyz6mgt-001 ]~#  lsdef -t site clustersite | grep shared
    sharedtftp=0
[root@xyz6mgt-001 ]~#  lsdef -t site clustersite | grep tft
    sharedtftp=0
    tftpdir=/tftpboot

Servicenode

[root@xyz6sn-001 ~]# lsdef -t site clustersite | grep tft
    sharedtftp=0
    tftpdir=/tftpboot
[root@xyz6sn-001 ~]# lsdef -t site clustersite | grep shared
    sharedtftp=0
gurevichmark commented 2 years ago

By default sharedtftp=1. With that setting, /tftpboot is mounted on service node from management node. If sharedtftp=0, you need to manually update /tftpboot on service node.