Open mrpg99 opened 2 years ago
@mrpg99 Can you show your node and osimage definition.
Hi,
lsdef -t osimage sle15.2-x86_64-install-compute Object name: sle15.2-x86_64-install-compute imagetype=linux osarch=x86_64 osdistroname=sle15.2-x86_64 osname=Linux osvers=sle15.2 otherpkgdir=/install/post/otherpkgs/sle15.2/x86_64 partitionfile=/install/partition/nodes-partition pkgdir=/install/sle15.2/x86_64 pkglist=/opt/xcat/share/xcat/install/sle/compute.sle15.pkglist profile=compute provmethod=install template=/opt/xcat/share/xcat/install/sle/compute.sle15.tmpl
lsdef -t node node1 Object name: node1 arch=x86_64 bmc=172.16.147.109 bmcpassword=*** bmcusername=myusername chassis=chassis1 currchain=boot currstate=install sle15.2-x86_64-compute groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.109 mac=d8:5e:d3:62:19:a8 mgt=ipmi netboot=pxe nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.109 nicnetworks.bond0=mynet nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack1 room=dataroom1 routenames=default_route slot=3 status=powering-off statustime=04-12-2022 18:08:16 updatestatus=synced updatestatustime=04-12-2022 15:48:11 xcatmaster=172.16.136.11
@mrpg99
What OS are you running on management cluster ?
Are you able to see the console as the node is booting ? Is there anything in the /var/log/consoles/node1.log
?
Have you tried booting without the additional postscripts, like setroute
, configbond
and base-post
?
You can also try to running xcatprobe osdeploy -n node1 -V
right after you start installation with rinstall
What OS are you running on management cluster ? sles 15 sp2 on everything including management node
Are you able to see the console as the node is booting ? yes i have a bmc connection to each node, so i can watch as its installing, as its not hitting the postscript file at all, its just constantly rebooting and installing the node from scratch
Is there anything in the /var/log/consoles/node1.log ? no files in there
Have you tried booting without the additional postscripts, like setroute, configbond and base-post ? Yes no difference as the xcatpostinit service dies before it gets to the post script.
You can also try to running xcatprobe osdeploy -n node1 -V right after you start installation with rinstall use it all the time, unfortunately it gives me no useful info.
Br Patric
and we have now discovered, that this is just a problem when you re-install the node. it works fine installing a node the first time, but if you want to re-provision the node, that is where we run into the never ending boot loop.
@mrpg99 This is a good clue. It might be related to https://github.com/xcat2/xcat-core/pull/7135. However we have only noticed this problem on Power, not on x86.
After the node is installed for the first time, check if there still /tftpboot/xcat/xnba/nodes/<node>.pxelinux
around.
If there is, there are few things you can try:
nodeset <node> offline
then try to re-provision the node./tftpboot/xcat/xnba/nodes/<node>.pxelinux
then try to re-provision the node.rsetboot <node> hd
then reboot it with rpower <node> boot
. Wait for node to boot from HD, then try to re-provision the node.Update on this issue, cause we have now found another thing. If we systemctl stop xcat.service
and systemctl start xcat.service
than we can fix nodeset node1 osimage=sle15.2-x86_64-install-compute
and the node will re-install without issues and the status will change from install to boot on the chain table and run post scripts.
we checked and there is no /tftpboot/xcat/xnba/nodes/<node>.pxelinux
around.
Best Regards, Lud
Forgot to add on the comment above, only the first installation works without issues after starting/stopping xcat service. If we set the node to re-install again we get the pxe boot loop. /Lud
@lud17 After the successful first installation (after starting/stopping xcat service):
tftpboot/xcat/xnba/nodes/<node>
?/var/log/xcat/xcat.log
on that node?rpower <node> boot
? Or does it try to enter the same pxe boot loop ?What are the contents of tftpboot/xcat/xnba/nodes/
?
There is no content. We do not have xnba/nodes/<node>
at all. Here is what we have:
xcatmn:~# ls /tftpboot/xcat/
elilo-x64.efi osimage xnba.efi xnba.kpxe
If you are able to login to the successfully installed compute node, any errors reported in /var/log/xcat/xcat.log on that node?
Yes, I am able to login to the node. This is the only error we get:
Thu Apr 14 21:33:45 CEST 2022 [info]: xcat.deployment: finished firstboot preparation, sending request to 172.16.136.11:3002 for changing status... updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: Retrying flag update updateflag.awk: flag update failed
Are you able to just reboot the successfully installed compute node with rpower
boot ? Or does it try to enter the same pxe boot loop ?
Yes, I can run rpower <node> boot
after successful installation.
Try changing the node definition attribute from netboot=pxe
to netboot=xnba
I've tried changing the netboot=xnba
, same as before, first installation not a problem. Re-installation of the node resulted in loop.
Did every single step below and all gave the same result, install loop. For some reason the only thing that works is to stop/start the xcat service.
nodeset <node> offline
then try to re-provision the node.remove file
/tftpboot/xcat/xnba/nodes/<node>.pxelinux
then try to re-provision the node.
rsetboot <node> hd
then reboot it withrpower <node> boot
. Wait for node to boot from HD, then try to re-provision the node.
@lud17
Can you try the following sequence of steps (with netboot=xnba
) and post the output:
restartxcatd
lsdef node1
ls -l /tftpboot/xcat/xnba/nodes/
xcatprobe xcatmn -i <interface facing the node1>
xcatprobe detect_dhcpd -i <interface facing the node1> -m d8:5e:d3:62:19:a8
chdef -t site clustersite xcatdebugmode=1
rinstall node1 osimage=sle15.2-x86_64-install-compute
ssh node1
systemctl status firewalld
cat /var/log/xcat/xcat.log
lsdef node1
ls -l /tftpboot/xcat/xnba/nodes/
rinstall node1 osimage=sle15.2-x86_64-install-compute
lsdef node1
ls -l /tftpboot/xcat/xnba/nodes/
tail -100 /var/log/xcat/computes.log
chdef -t site clustersite xcatdebugmode=0
Can you try the following sequence of steps (with
netboot=xnba
) and post the output:
- On management node:
restartxcatd
restartxcatd restartxcatd invoked by root.
Restarting xCATd [ OK ]
On management node:
lsdef node1
lsdef node1 Object name: node1 arch=x86_64 bmc=172.16.147.107 bmcpassword=ourpassword bmcusername=admin chassis=chassie01 cons=ipmi consoleenabled=1 currchain=boot currstate=boot groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.107 mac=D8:5E:D3:62:19:C4 mgt=ipmi netboot=xnba nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.107 nicnetworks.bond0=ournetwork nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack01 room=PVSV01:20 routenames=default_route slot=1 status=booting statustime=04-21-2022 05:23:32 xcatmaster=172.16.136.11On management node:
ls -l /tftpboot/xcat/xnba/nodes/
this is a cluster of over 800 nodes, so wont type the list of files in here, but there is two files for each node node1 node1.uefiOn management node:
xcatprobe xcatmn -i <interface facing the node1>
xcatprobe xcatmn -i bond1 [mn]: Checking all xCAT daemons are running... [ OK ] [mn]: Checking xcatd can receive command request... [ OK ] [mn]: Checking 'site' table is configured... [ OK ] [mn]: Checking provision network is configured... [ OK ] [mn]: Checking 'passwd' table is configured... [ OK ] [mn]: Checking important directories(installdir,tftpdir) are configured... [ OK ] [mn]: Checking SELinux is disabled... [ OK ] [mn]: Checking HTTP service is configured... [ OK ] [mn]: Checking TFTP service is configured... [ OK ] [mn]: Checking DNS service is configured... [WARN] [mn]: DNS nameserver 127.0.0.1 can not resolve 172.16.136.11 [mn]: Checking DHCP service is configured... [ OK ] [mn]: Checking NTP service is configured... [FAIL] [mn]: chronyd did not synchronize. [mn]: Checking rsyslog service is configured... [ OK ] [mn]: Checking firewall is disabled... [ OK ] [mn]: Checking minimum disk space for xCAT ['/install' needs 10GB;'/tmp' needs 1GB;'/var' needs 1GB]... [ OK ] [mn]: Checking Linux ulimits configuration... [ OK ] [mn]: Checking network kernel parameter configuration... [ OK ] [mn]: Checking xCAT daemon attributes configuration... [ OK ] [mn]: Checking xCAT log is stored in /var/log/xcat/cluster.log... [ OK ] [mn]: Checking xCAT management node IP: <172.16.136.11> is configured to static... [ OK ] [mn]: Checking dhcpd.leases file is less than 100M... [ OK ] [mn]: Checking DB packages installation... [ OK ] =================================== SUMMARY ==================================== [MN]: Checking on MN... [FAIL] Checking DNS service is configured... [WARN] DNS nameserver 127.0.0.1 can not resolve 172.16.136.11 Checking NTP service is configured... [FAIL] chronyd did not synchronize.
On management node:
chdef -t site clustersite xcatdebugmode=1
chdef -t site clustersite xcatdebugmode=1 1 object definitions have been created or modifiedOn management node:
rinstall node1 osimage=sle15.2-x86_64-install-compute
rinstall node1 osimage=sle15.2-x86_64-install-compute Provision node(s): node1Wait for node to finish installation. Since this is first time after restart, I assume the node will install successfully.
On management node:
ssh node1
On node1:
systemctl status firewalld
systemctl status firewalld Unit firewalld.service could not be found.On node1:
cat /var/log/xcat/xcat.log
log is here : https://pastebin.com/99qpTt80On management node:
lsdef node1
lsdef node1 Object name: node1 arch=x86_64 bmc=172.16.147.107 bmcpassword=ourpassword bmcusername=admin chassis=chassie01 cons=ipmi consoleenabled=1 currchain=standby currstate=standby groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.107 mac=D8:5E:D3:62:19:C4 mgt=ipmi netboot=xnba nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.107 nicnetworks.bond0=ournetwork nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack01 room=PVSV01:20 routenames=default_route slot=1 status=booting statustime=04-28-2022 04:19:20 xcatmaster=172.16.136.11On management node:
ls -l /tftpboot/xcat/xnba/nodes/
the files for node1 are still there node1 node1.uefiOn management node:
rinstall node1 osimage=sle15.2-x86_64-install-compute
rinstall node1 osimage=sle15.2-x86_64-install-compute Provision node(s): node1Wait a few min, I assume the node will enter the never ending boot loop you describe
On management node:
lsdef node1
lsdef node1 Object name: node1 arch=x86_64 bmc=172.16.147.107 bmcpassword=ourpassword bmcusername=admin chassis=chassie01 cons=ipmi consoleenabled=1 currchain=standby currstate=standby groups=all,bmc,sles15.2,base installnic=mac ip=172.16.139.107 mac=D8:5E:D3:62:19:C4 mgt=ipmi netboot=xnba nicdevices.bond0=eth2|eth3 nicextraparams.bond0=BONDING_MODULE_OPTS=mode=1;miimon=100 nicips.bond0=10.122.19.107 nicnetworks.bond0=ournetwork nictypes.eth2=Ethernet nictypes.bond0=Bond nictypes.eth3=Ethernet os=sle15.2 postbootscripts=otherpkgs,setroute postscripts=syslog,remoteshell,syncfiles,configbond bond0 eth2@eth3,base-post,setupntp profile=compute provmethod=sle15.2-x86_64-install-compute rack=rack01 room=PVSV01:20 routenames=default_route slot=1 status=powering-on statustime=04-28-2022 04:19:20 xcatmaster=172.16.136.11On management node:
ls -l /tftpboot/xcat/xnba/nodes/
the files for node1 is still there.On management node:
tail -100 /var/log/xcat/computes.log
Log is here https://pastebin.com/xWN84h0Q
- On management node:
chdef -t site clustersite xcatdebugmode=0
chdef -t site clustersite xcatdebugmode=0 1 object definitions have been created or modified.
Best Regards Patric
Ok more info, things are getting more confusing.
We were looking at different logs and saw that apparmor was running on the headnode, yast says apparmor is disabled (tickbox is not ticked) but the systemd service is alive
So we stopped and masked the apparmor systemd service and rebooted
now we can install the same node 2 times! , but the chain for the node after the first install is :
"node1","standby","standby",,,,
After the second install the chain is set to : "node1","install sle15.2-x86_64-compute","boot",,,,
But here is another twist, xcat has now changed the boot order of the node, it boots straight from hdd without trying to pxe boot.
can anyone help us?
Br Patric
Do you mean that after the second install is finished, if you run rinstall node1 osimage=<osimage>
, the node will just boot from hdd instead of reinstalling the specified <osimage>
?
Do you mean that after the second install is finished, if you run
rinstall node1 osimage=<osimage>
, the node will just boot from hdd instead of reinstalling the specified<osimage>
?
yes correctly understood.
chain is still "node1","install sle15.2-x86_64-compute","boot",,,,
But the node bypasses the pxe boot order we have set
Br Patric
Check contents of /tftpboot/xcat/xnba/nodes/node1
before and after the second boot.
went back to the sequence steps and apparently we do have an error:
xcatmgn:~ # xcatprobe xcatmn -i bond1
[mn]: Checking all xCAT daemons are running... [FAIL]
[mn]: Daemon 'install monitor' isn't running
=================================== SUMMARY ===========================...
[MN]: Checking on MN... [FAIL]
Checking all xCAT daemons are running... [FAIL]
Daemon 'install monitor' isn't running
Br Lud
When you run this ps
command, do you see something similar ?
# ps aux 2>&1|grep -v grep|grep xcatd
root 4561 0.2 1.3 136004 53784 ? S 11:23 0:00 xcatd SSL: genimage for root@localhost
root 4562 0.0 1.2 135784 48972 ? S 11:23 0:00 xcatd SSL: genimage for root@localhost: genimag
root 31530 0.0 1.1 128940 45236 ? Ss 07:51 0:00 xcatd: SSL listener
root 31531 0.0 1.1 129436 46296 ? S 07:51 0:11 xcatd: DB Access
root 31532 0.0 1.1 128940 44780 ? S 07:51 0:02 xcatd: UDP listener
root 31533 0.0 1.4 148216 57852 ? S 07:51 0:00 xcatd: install monitor
root 31534 0.0 1.1 129080 44832 ? S 07:51 0:00 xcatd: Command log writer
root 31535 0.0 1.1 128940 44872 ? S 07:51 0:00 xcatd: Discovery worker
#
Somewhat similar I would say. When I did get theDaemon 'install monitor' isn't running
error install monitor was of course not present on the list below. I had to restart xCAT.
# ps aux 2>&1|grep -v grep|grep xcatd
root 27526 0.0 0.0 165624 84172 ? Ss 14:16 0:00 xcatd: SSL listener
root 27527 0.0 0.0 190064 108904 ? S 14:16 0:06 xcatd: DB Access
root 27528 0.0 0.0 165176 83148 ? S 14:16 0:02 xcatd: UDP listener
root 27529 0.0 0.0 165176 82664 ? S 14:16 0:00 xcatd: install monitor
root 27530 0.0 0.0 165176 82304 ? S 14:16 0:00 xcatd: Discovery worker
root 27531 0.0 0.0 165328 83108 ? S 14:16 0:00 xcatd: Command log writer
/Lud
Can you check if the install monitor
still running after the first successful install and if it disappears after the second successful install ?
install monitor
is not running after the first successful install, it have already disappeared.
Dear Sir/Madam, We are trying to setup a sles 15sp2 cluster, using xcat :
lsxcatd -a Version 2.16.3 (git commit d6c76ae5f66566409c3416c0836660e655632194, built Wed Nov 10 09:58:20 EST 2021) This is a Management Node
The nodes are stuck in a pxe boot loop, and does not seem to even be able to start the post install script. We found this in /var/log/xcat/computes.log
Apr 11 22:29:59 cs3-0868 systemd[12350]: INFO Startup finished in 36ms. Apr 11 22:29:59 cs3-0868 systemd[1]: INFO Started User Manager for UID 0. Apr 11 22:29:59 cs3-0868 sshd[12348]: INFO pam_unix(sshd:session): session opened for user root by (uid=0) Apr 11 22:30:05 cs3-0867 xcat: INFO message repeated 7 times: [ Retrying flag update] Apr 11 22:30:05 cs3-0867 systemd[1]: WARNING xcatpostinit1.service: Stopping timed out. Terminating. Apr 11 22:30:05 cs3-0867 systemd[1]: NOTICE xcatpostinit1.service: Control process exited, code=killed status=15 Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopped xcat service on compute node, the framework to run postbootscript and update node status. Apr 11 22:30:05 cs3-0867 systemd[1]: NOTICE xcatpostinit1.service: Unit entered failed state. Apr 11 22:30:05 cs3-0867 systemd[1]: WARNING xcatpostinit1.service: Failed with result 'timeout'. Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopped target Network. Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopping wicked managed network interfaces... Apr 11 22:30:05 cs3-0867 systemd[1]: INFO Stopped target RDMA Hardware. Apr 11 22:30:38 cs3-0868 sshd[12348]: INFO Received disconnect from 172.16.136.11 port 46864:11: disconnected by user Apr 11 22:30:38 cs3-0868 sshd[12348]: INFO Disconnected from user root 172.16.136.11 port 46864 Apr 11 22:30:38 cs3-0868 sshd[12348]: INFO pam_unix(sshd:session): session closed for user root Apr 11 22:30:38 cs3-0868 systemd-logind[9044]: INFO Session 1 logged out. Waiting for processes to exit. Apr 11 22:30:38 cs3-0868 systemd-logind[9044]: INFO Removed session 1. Apr 11 22:30:38 cs3-0868 systemd[1]: INFO Stopping User Manager for UID 0... Apr 11 22:30:38 cs3-0868 systemd[12350]: INFO Stopped target Default. Apr 11 22:30:38 cs3-0868 systemd[12350]: INFO Stopped target Basic System. Apr 11 22:30:38 cs3-0868 systemd[12350]: INFO Stopped target Sockets.
Do you need further info?
Any idea how we can fix this?
Many thanks in advanced for any suggestions!
Best Regards Patric