xcatpostinit not responding / crashing?

mrpg99 commented 2 years ago

Dear Sir/Madam, We are trying to setup a sles 15sp2 cluster, using xcat :

lsxcatd -a Version 2.16.3 (git commit d6c76ae5f66566409c3416c0836660e655632194, built Wed Nov 10 09:58:20 EST 2021) This is a Management Node

The nodes are stuck in a pxe boot loop, and does not seem to even be able to start the post install script. We found this in /var/log/xcat/computes.log

Apr 11 22:29:59 cs3-0868 systemd[12350]: INFO Startup finished in 36ms. Apr 11 22:29:59 cs3-0868 systemd[1]: INFO Started User Manager for UID 0. Apr 11 22:29:59 cs3-0868 sshd[12348]: INFO pam_unix(sshd:session): session opened for user root by (uid=0) Apr 11 22:30:05 cs3-0867 xcat: INFO message repeated 7 times: [ Retrying flag update] Apr 11 22:30:05 cs3-0867 systemd[1]: WARNING xcatpostinit1.service: Stopping timed out. Terminating. Apr 11 22:30:05 cs3-0867 systemd[1]: NOTICE xcatpostinit1.service: Control process exited, code=killed status=15 Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopped xcat service on compute node, the framework to run postbootscript and update node status. Apr 11 22:30:05 cs3-0867 systemd[1]: NOTICE xcatpostinit1.service: Unit entered failed state. Apr 11 22:30:05 cs3-0867 systemd[1]: WARNING xcatpostinit1.service: Failed with result 'timeout'. Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopped target Network. Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopping wicked managed network interfaces... Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopped target RDMA Hardware. Apr 11 22:30:38 cs3-0868 sshd[12348]: INFO Received disconnect from 172.16.136.11 port 46864:11: disconnected by user Apr 11 22:30:38 cs3-0868 sshd[12348]: INFO Disconnected from user root 172.16.136.11 port 46864 Apr 11 22:30:38 cs3-0868 sshd[12348]: INFO pam_unix(sshd:session): session closed for user root Apr 11 22:30:38 cs3-0868 systemd-logind[9044]: INFO Session 1 logged out. Waiting for processes to exit. Apr 11 22:30:38 cs3-0868 systemd-logind[9044]: INFO Removed session 1. Apr 11 22:30:38 cs3-0868 systemd[1]: INFO Stopping User Manager for UID 0... Apr 11 22:30:38 cs3-0868 systemd[12350]: INFO Stopped target Default. Apr 11 22:30:38 cs3-0868 systemd[12350]: INFO Stopped target Basic System. Apr 11 22:30:38 cs3-0868 systemd[12350]: INFO Stopped target Sockets.

Do you need further info?

Any idea how we can fix this?

Many thanks in advanced for any suggestions!

Best Regards Patric

gurevichmark commented 2 years ago

@mrpg99 Can you show your node and osimage definition.

mrpg99 commented 2 years ago

Hi,

lsdef -t osimage sle15.2-x86_64-install-compute Object name: sle15.2-x86_64-install-compute imagetype=linux osarch=x86_64 osdistroname=sle15.2-x86_64 osname=Linux osvers=sle15.2 otherpkgdir=/install/post/otherpkgs/sle15.2/x86_64 partitionfile=/install/partition/nodes-partition pkgdir=/install/sle15.2/x86_64 pkglist=/opt/xcat/share/xcat/install/sle/compute.sle15.pkglist profile=compute provmethod=install template=/opt/xcat/share/xcat/install/sle/compute.sle15.tmpl

lsdef -t node node1 Object name: node1 arch=x86_64 bmc=172.16.147.109 bmcpassword=*** bmcusername=myusername chassis=chassis1 currchain=boot currstate=install sle15.2-x86_64-compute groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.109 mac=d8:5e:d3:62:19:a8 mgt=ipmi netboot=pxe nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.109 nicnetworks.bond0=mynet nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack1 room=dataroom1 routenames=default_route slot=3 status=powering-off statustime=04-12-2022 18:08:16 updatestatus=synced updatestatustime=04-12-2022 15:48:11 xcatmaster=172.16.136.11

gurevichmark commented 2 years ago

@mrpg99

What OS are you running on management cluster ? Are you able to see the console as the node is booting ? Is there anything in the /var/log/consoles/node1.log ? Have you tried booting without the additional postscripts, like setroute, configbond and base-post ?

You can also try to running xcatprobe osdeploy -n node1 -V right after you start installation with rinstall

mrpg99 commented 2 years ago

What OS are you running on management cluster ? sles 15 sp2 on everything including management node

Are you able to see the console as the node is booting ? yes i have a bmc connection to each node, so i can watch as its installing, as its not hitting the postscript file at all, its just constantly rebooting and installing the node from scratch

Is there anything in the /var/log/consoles/node1.log ? no files in there

Have you tried booting without the additional postscripts, like setroute, configbond and base-post ? Yes no difference as the xcatpostinit service dies before it gets to the post script.

You can also try to running xcatprobe osdeploy -n node1 -V right after you start installation with rinstall use it all the time, unfortunately it gives me no useful info.

Br Patric

mrpg99 commented 2 years ago

and we have now discovered, that this is just a problem when you re-install the node. it works fine installing a node the first time, but if you want to re-provision the node, that is where we run into the never ending boot loop.

gurevichmark commented 2 years ago

@mrpg99 This is a good clue. It might be related to https://github.com/xcat2/xcat-core/pull/7135. However we have only noticed this problem on Power, not on x86.

After the node is installed for the first time, check if there still /tftpboot/xcat/xnba/nodes/<node>.pxelinux around. If there is, there are few things you can try:

nodeset <node> offline then try to re-provision the node.
remove file /tftpboot/xcat/xnba/nodes/<node>.pxelinux then try to re-provision the node.
rsetboot <node> hd then reboot it with rpower <node> boot. Wait for node to boot from HD, then try to re-provision the node.

lud17 commented 2 years ago

Update on this issue, cause we have now found another thing. If we systemctl stop xcat.service and systemctl start xcat.servicethan we can fix nodeset node1 osimage=sle15.2-x86_64-install-compute and the node will re-install without issues and the status will change from install to boot on the chain table and run post scripts.

we checked and there is no /tftpboot/xcat/xnba/nodes/<node>.pxelinuxaround.

Best Regards, Lud

lud17 commented 2 years ago

Forgot to add on the comment above, only the first installation works without issues after starting/stopping xcat service. If we set the node to re-install again we get the pxe boot loop. /Lud

gurevichmark commented 2 years ago

@lud17 After the successful first installation (after starting/stopping xcat service):

What are the contents of tftpboot/xcat/xnba/nodes/<node> ?
If you are able to login to the successfully installed compute node, any errors reported in /var/log/xcat/xcat.log on that node?
Are you able to just reboot the successfully installed compute node with rpower <node> boot ? Or does it try to enter the same pxe boot loop ?

lud17 commented 2 years ago

What are the contents of tftpboot/xcat/xnba/nodes/ ?

There is no content. We do not have xnba/nodes/<node> at all. Here is what we have:

xcatmn:~# ls /tftpboot/xcat/
elilo-x64.efi  osimage  xnba.efi  xnba.kpxe

If you are able to login to the successfully installed compute node, any errors reported in /var/log/xcat/xcat.log on that node?

Yes, I am able to login to the node. This is the only error we get:

Thu Apr 14 21:33:45 CEST 2022 [info]: xcat.deployment: finished firstboot preparation, sending request to 172.16.136.11:3002 for changing status... updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: flag update failed

Are you able to just reboot the successfully installed compute node with rpower boot ? Or does it try to enter the same pxe boot loop ?

Yes, I can run rpower <node> boot after successful installation.

gurevichmark commented 2 years ago

Try changing the node definition attribute from netboot=pxe to netboot=xnba

lud17 commented 2 years ago

I've tried changing the netboot=xnba, same as before, first installation not a problem. Re-installation of the node resulted in loop.

Did every single step below and all gave the same result, install loop. For some reason the only thing that works is to stop/start the xcat service.

nodeset <node> offline then try to re-provision the node.

remove file/tftpboot/xcat/xnba/nodes/<node>.pxelinuxthen try to re-provision the node.

rsetboot <node> hd then reboot it with rpower <node> boot. Wait for node to boot from HD, then try to re-provision the node.

gurevichmark commented 2 years ago

@lud17

Can you try the following sequence of steps (with netboot=xnba) and post the output:

On management node: restartxcatd
On management node: lsdef node1
On management node: ls -l /tftpboot/xcat/xnba/nodes/
On management node: xcatprobe xcatmn -i <interface facing the node1>
On management node: xcatprobe detect_dhcpd -i <interface facing the node1> -m d8:5e:d3:62:19:a8
On management node: chdef -t site clustersite xcatdebugmode=1
On management node: rinstall node1 osimage=sle15.2-x86_64-install-compute
Wait for node to finish installation. Since this is first time after restart, I assume the node will install successfully.
On management node: ssh node1
On node1: systemctl status firewalld
On node1: cat /var/log/xcat/xcat.log
On management node: lsdef node1
On management node: ls -l /tftpboot/xcat/xnba/nodes/
On management node: rinstall node1 osimage=sle15.2-x86_64-install-compute
Wait a few min, I assume the node will enter the never ending boot loop you describe
On management node: lsdef node1
On management node: ls -l /tftpboot/xcat/xnba/nodes/
On management node: tail -100 /var/log/xcat/computes.log
On management node: chdef -t site clustersite xcatdebugmode=0

mrpg99 commented 2 years ago

Can you try the following sequence of steps (with netboot=xnba) and post the output:

On management node: restartxcatd restartxcatd restartxcatd invoked by root.

Restarting xCATd [ OK ]

On management node: lsdef node1 lsdef node1 Object name: node1 arch=x86_64 bmc=172.16.147.107 bmcpassword=ourpassword bmcusername=admin chassis=chassie01 cons=ipmi consoleenabled=1 currchain=boot currstate=boot groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.107 mac=D8:5E:D3:62:19:C4 mgt=ipmi netboot=xnba nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.107 nicnetworks.bond0=ournetwork nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack01 room=PVSV01:20 routenames=default_route slot=1 status=booting statustime=04-21-2022 05:23:32 xcatmaster=172.16.136.11

On management node: ls -l /tftpboot/xcat/xnba/nodes/ this is a cluster of over 800 nodes, so wont type the list of files in here, but there is two files for each node node1 node1.uefi

On management node: xcatprobe xcatmn -i <interface facing the node1>

xcatprobe xcatmn -i bond1 [mn]: Checking all xCAT daemons are running... [ OK ] [mn]: Checking xcatd can receive command request... [ OK ] [mn]: Checking 'site' table is configured... [ OK ] [mn]: Checking provision network is configured... [ OK ] [mn]: Checking 'passwd' table is configured... [ OK ] [mn]: Checking important directories(installdir,tftpdir) are configured... [ OK ] [mn]: Checking SELinux is disabled... [ OK ] [mn]: Checking HTTP service is configured... [ OK ] [mn]: Checking TFTP service is configured... [ OK ] [mn]: Checking DNS service is configured... [WARN] [mn]: DNS nameserver 127.0.0.1 can not resolve 172.16.136.11 [mn]: Checking DHCP service is configured... [ OK ] [mn]: Checking NTP service is configured... [FAIL] [mn]: chronyd did not synchronize. [mn]: Checking rsyslog service is configured... [ OK ] [mn]: Checking firewall is disabled... [ OK ] [mn]: Checking minimum disk space for xCAT ['/install' needs 10GB;'/tmp' needs 1GB;'/var' needs 1GB]... [ OK ] [mn]: Checking Linux ulimits configuration... [ OK ] [mn]: Checking network kernel parameter configuration... [ OK ] [mn]: Checking xCAT daemon attributes configuration... [ OK ] [mn]: Checking xCAT log is stored in /var/log/xcat/cluster.log... [ OK ] [mn]: Checking xCAT management node IP: <172.16.136.11> is configured to static... [ OK ] [mn]: Checking dhcpd.leases file is less than 100M... [ OK ] [mn]: Checking DB packages installation... [ OK ] =================================== SUMMARY ==================================== [MN]: Checking on MN... [FAIL] Checking DNS service is configured... [WARN] DNS nameserver 127.0.0.1 can not resolve 172.16.136.11 Checking NTP service is configured... [FAIL] chronyd did not synchronize.

On management node: chdef -t site clustersite xcatdebugmode=1 chdef -t site clustersite xcatdebugmode=1 1 object definitions have been created or modified

On management node: rinstall node1 osimage=sle15.2-x86_64-install-compute rinstall node1 osimage=sle15.2-x86_64-install-compute Provision node(s): node1

Wait for node to finish installation. Since this is first time after restart, I assume the node will install successfully.

On management node: ssh node1

On node1: systemctl status firewalld systemctl status firewalld Unit firewalld.service could not be found.

On node1: cat /var/log/xcat/xcat.log log is here : https://pastebin.com/99qpTt80

On management node: lsdef node1 lsdef node1 Object name: node1 arch=x86_64 bmc=172.16.147.107 bmcpassword=ourpassword bmcusername=admin chassis=chassie01 cons=ipmi consoleenabled=1 currchain=standby currstate=standby groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.107 mac=D8:5E:D3:62:19:C4 mgt=ipmi netboot=xnba nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.107 nicnetworks.bond0=ournetwork nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack01 room=PVSV01:20 routenames=default_route slot=1 status=booting statustime=04-28-2022 04:19:20 xcatmaster=172.16.136.11

On management node: ls -l /tftpboot/xcat/xnba/nodes/ the files for node1 are still there node1 node1.uefi

On management node: rinstall node1 osimage=sle15.2-x86_64-install-compute rinstall node1 osimage=sle15.2-x86_64-install-compute Provision node(s): node1

Wait a few min, I assume the node will enter the never ending boot loop you describe

On management node: lsdef node1 lsdef node1 Object name: node1 arch=x86_64 bmc=172.16.147.107 bmcpassword=ourpassword bmcusername=admin chassis=chassie01 cons=ipmi consoleenabled=1 currchain=standby currstate=standby groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.107 mac=D8:5E:D3:62:19:C4 mgt=ipmi netboot=xnba nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.107 nicnetworks.bond0=ournetwork nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack01 room=PVSV01:20 routenames=default_route slot=1 status=powering-on statustime=04-28-2022 04:19:20 xcatmaster=172.16.136.11

On management node: ls -l /tftpboot/xcat/xnba/nodes/ the files for node1 is still there.

On management node: tail -100 /var/log/xcat/computes.log

Log is here https://pastebin.com/xWN84h0Q

On management node: chdef -t site clustersite xcatdebugmode=0 chdef -t site clustersite xcatdebugmode=0 1 object definitions have been created or modified.

Best Regards Patric

mrpg99 commented 2 years ago

Ok more info, things are getting more confusing.

We were looking at different logs and saw that apparmor was running on the headnode, yast says apparmor is disabled (tickbox is not ticked) but the systemd service is alive

So we stopped and masked the apparmor systemd service and rebooted

now we can install the same node 2 times! , but the chain for the node after the first install is :

"node1","standby","standby",,,,

After the second install the chain is set to : "node1","install sle15.2-x86_64-compute","boot",,,,

But here is another twist, xcat has now changed the boot order of the node, it boots straight from hdd without trying to pxe boot.

can anyone help us?

Br Patric

gurevichmark commented 2 years ago

Do you mean that after the second install is finished, if you run rinstall node1 osimage=<osimage>, the node will just boot from hdd instead of reinstalling the specified <osimage> ?

mrpg99 commented 2 years ago

Do you mean that after the second install is finished, if you run rinstall node1 osimage=<osimage>, the node will just boot from hdd instead of reinstalling the specified <osimage> ?

yes correctly understood.

chain is still "node1","install sle15.2-x86_64-compute","boot",,,,

But the node bypasses the pxe boot order we have set

Br Patric

gurevichmark commented 2 years ago

Check contents of /tftpboot/xcat/xnba/nodes/node1 before and after the second boot.

lud17 commented 2 years ago

went back to the sequence steps and apparently we do have an error:

xcatmgn:~ # xcatprobe xcatmn -i bond1
[mn]: Checking all xCAT daemons are running...                            [FAIL]
[mn]: Daemon 'install monitor' isn't running
=================================== SUMMARY ===========================...
[MN]: Checking on MN...                                                   [FAIL]
    Checking all xCAT daemons are running...                              [FAIL]
        Daemon 'install monitor' isn't running

Br Lud

gurevichmark commented 2 years ago

When you run this ps command, do you see something similar ?

# ps aux 2>&1|grep -v grep|grep xcatd
root      4561  0.2  1.3 136004 53784 ?        S    11:23   0:00 xcatd SSL: genimage for root@localhost
root      4562  0.0  1.2 135784 48972 ?        S    11:23   0:00 xcatd SSL: genimage for root@localhost: genimag
root     31530  0.0  1.1 128940 45236 ?        Ss   07:51   0:00 xcatd: SSL listener
root     31531  0.0  1.1 129436 46296 ?        S    07:51   0:11 xcatd: DB Access
root     31532  0.0  1.1 128940 44780 ?        S    07:51   0:02 xcatd: UDP listener
root     31533  0.0  1.4 148216 57852 ?        S    07:51   0:00 xcatd: install monitor
root     31534  0.0  1.1 129080 44832 ?        S    07:51   0:00 xcatd: Command log writer
root     31535  0.0  1.1 128940 44872 ?        S    07:51   0:00 xcatd: Discovery worker
#

lud17 commented 2 years ago

Somewhat similar I would say. When I did get theDaemon 'install monitor' isn't running error install monitor was of course not present on the list below. I had to restart xCAT.

# ps aux 2>&1|grep -v grep|grep xcatd
root     27526  0.0  0.0 165624 84172 ?        Ss   14:16   0:00 xcatd: SSL listener                            
root     27527  0.0  0.0 190064 108904 ?       S    14:16   0:06 xcatd: DB Access                               
root     27528  0.0  0.0 165176 83148 ?        S    14:16   0:02 xcatd: UDP listener                            
root     27529  0.0  0.0 165176 82664 ?        S    14:16   0:00 xcatd: install monitor                         
root     27530  0.0  0.0 165176 82304 ?        S    14:16   0:00 xcatd: Discovery worker                        
root     27531  0.0  0.0 165328 83108 ?        S    14:16   0:00 xcatd: Command log writer

/Lud

gurevichmark commented 2 years ago

Can you check if the install monitor still running after the first successful install and if it disappears after the second successful install ?

lud17 commented 2 years ago

install monitor is not running after the first successful install, it have already disappeared.

xcat2 / xcat-core

xcatpostinit not responding / crashing? #7142