xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
360 stars 171 forks source link

Postinstall and xcatdebugmode #4604

Closed yardan11 closed 6 years ago

yardan11 commented 6 years ago

We have encountered a strange issue where sending any node to rinstall "hangs" after finishing the postscripts - it never reboots, and therefore never continues to the postbootscripts. Trying to diagnose the issue led to the strange bit. Setting xcatdebugmode=1 in site table SOLVED the issue while still not showing any error in any log.

We have verified this is indeed the case - setting it to 0 reverts to non functioning rinstall, re-setting to 1 and rinstall works without an issue.

We would like to work without debugmode - what might be the issue and how can we solve this?

xCAT version - 2.13.8 xCAT node OS - RHEL 7.4 Nodes OS - RHEL 6.5

zet809 commented 6 years ago

Hi, @immarvin , will you pls help to take a look at this issue? Thx!

immarvin commented 6 years ago

hi @yardan11 , what is the output of lsdef <node> -i status when rinstall hang there

yardan11 commented 6 years ago

Hi

Will send data asap.

On Thu, Jan 4, 2018 at 11:52 AM, yangsong notifications@github.com wrote:

hi @yardan11 https://github.com/yardan11 , what is the output of lsdef

-i status when rinstall hang there — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

-- Best Regards Yaron, Shirley, Romi, Libi & Yuli

          __^^^^^__
           / @ @ \

===ooO==(_)==Ooo========

yardan11 commented 6 years ago

Hi

Here is the output:

On Thu, Jan 4, 2018 at 12:01 PM, Yaron Daniel yardan@gmail.com wrote:

Hi

Will send data asap.

On Thu, Jan 4, 2018 at 11:52 AM, yangsong notifications@github.com wrote:

hi @yardan11 https://github.com/yardan11 , what is the output of lsdef

-i status when rinstall hang there — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

-- Best Regards Yaron, Shirley, Romi, Libi & Yuli

          __^^^^^__
           / @ @ \

===ooO==(_)==Ooo========

-- Best Regards Yaron, Shirley, Romi, Libi & Yuli

          __^^^^^__
           / @ @ \

===ooO==(_)==Ooo========

zet809 commented 6 years ago

hi, @yardan11 , no data get from your response.

yardan11 commented 6 years ago

Hi

I send 2 files as attachment

On Jan 5, 2018 04:11, "zet809" notifications@github.com wrote:

hi, @yardan11 https://github.com/yardan11 , no data get from your response.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/4604#issuecomment-355454433, or mute the thread https://github.com/notifications/unsubscribe-auth/AfULkHQ_4yGu3iYhtvm2cOXvUOtB6yOoks5tHYTVgaJpZM4RSuJh .

immarvin commented 6 years ago

hi @yardan11 , cannot find attachments in GitHub comments, you might need to attach the files to GitHub thread https://github.com/xcat2/xcat-core/issues/4604 instead of replying the notification

yardan11 commented 6 years ago

whatsapp image 2018-01-04 at 15 52 37 whatsapp image 2018-01-04 at 15 38 31

yardan11 commented 6 years ago

Hi

Attach files to the thread.

On Jan 5, 2018 07:07, "yangsong" notifications@github.com wrote:

hi @yardan11 https://github.com/yardan11 , cannot find attachments in GitHub comments, you might need to attach the files to GitHub thread #4604 https://github.com/xcat2/xcat-core/issues/4604 instead of replying the notification

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/4604#issuecomment-355474139, or mute the thread https://github.com/notifications/unsubscribe-auth/AfULkPAxiSBmfrWymADcBekm9swQhQTYks5tHa4agaJpZM4RSuJh .

immarvin commented 6 years ago

strange that status is 'ping'...., what is the output of # lsdef -t site -o clustersite -i nodestatus?

immarvin commented 6 years ago

when the node hang, please run nodeset <node> boot and rpower <node> boot, then you can access the installed os via ssh or rcons, please provide the content of file /var/log/xcat/xcat.log on the provisioned node

immarvin commented 6 years ago

the /var/log/xcat/xcat.log file contains some running information of postscript and postbootscript(if run), maybe we can find the last step before the hang

zet809 commented 6 years ago

hi, @yardan11 , any updates for this issue?

yardan11 commented 6 years ago

Hi Yes - there are updates:

  1. When running "lsdef -t site -o clustersite -i nodestatus" I get nothing - and when running the same command without "-i nodestatus" I see that there is no "nodestatus" entry in the output.

  2. One more thing that we want to note. We added debug prints to all of our post scripts and to postbootscripts. We see that all postscripts are done running but the node does not reboot. When we just reboot it manually, without any changes to Xcat tables (no nodeset is required, and ssh to the node is also possible when it is stuck), it just works - it boots up, runs postbootscripts and done. So apparently the node does succeed telling Xcat that it is done running scripts,but the node just doesn't reboot on its own. What do you think? Please tell us if you still need the logs.

They have the reboot line, just as you demonstrated. By comparing the template to the old xCAT's template we see that they have the same structure more or less, with a "reboot" command on exactly the same place. Needless to say that the current template does work with xcatdebugmode=1, so I find it a little bit hard to believe that the setup hangs because of a missing reboot directive.

yardan11 commented 6 years ago

Tue Jan 9 10:16:01 IST 2018 Running postscript: remoteshell Tue Jan 9 10:21:00 IST 2018 postscript remoteshell return with 0 Tue Jan 9 10:21:00 IST 2018 Running postscript: syslog Tue Jan 9 10:21:00 IST 2018 postscript syslog return with 0 Tue Jan 9 10:21:00 IST 2018 Running postscript: syncfiles Tue Jan 9 10:21:00 IST 2018 postscript syncfiles return with 0 Tue Jan 9 10:21:00 IST 2018 Running postscript: ospkgs Tue Jan 9 10:21:05 IST 2018 postscript ospkgs return with 0 Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_install_date Tue Jan 9 10:21:05 IST 2018 postscript tv_install_date return with 0 Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_hardents Tue Jan 9 10:21:05 IST 2018 postscript tv_hardents return with 0 Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_click_hpc_nodes Tue Jan 9 10:21:17 IST 2018 postscript tv_click_hpc_nodes return with 0 Tue Jan 9 10:21:17 IST 2018 Running postscript: tv_mlnx_ofed Tue Jan 9 10:25:23 IST 2018 postscript tv_mlnx_ofed return with 0 Tue Jan 9 10:25:23 IST 2018 Running postscript: tv_setupservices Tue Jan 9 10:25:23 IST 2018 postscript tv_setupservices return with 0

Osimage: Object name: rhels6.5-x86_64-install-compute imagetype=linux osarch=x86_64 osdistroname=rhels6.5-x86_64 osname=Linux osvers=rhels6.5 otherpkgdir=/install/post/otherpkgs/rhels6.5/x86-64 otherpkglist=/install/custom/install/rh/compute.rhels6.5.otherpkgs.pkglist pkgdir=/install/rhels6.5/x86_64 pkglist=/install/custom/install/rh/compute.rhels6.5.x86_64.pkglist profile=compute provmethod=install template=/install/custom/install/rh/compute.rhels6.tmpl

immarvin commented 6 years ago

hi @yardan11 , does the content of /var/log/xcat/xcat.log you provided included the running of postbootscripts after manual reboot? i noticed that there is a 4-min time gap in the HERE position in the log entries, is it the time when node hang and you reboot it manually?

Tue Jan 9 10:16:01 IST 2018 Running postscript: remoteshell
Tue Jan 9 10:21:00 IST 2018 postscript remoteshell return with 0
Tue Jan 9 10:21:00 IST 2018 Running postscript: syslog
Tue Jan 9 10:21:00 IST 2018 postscript syslog return with 0
Tue Jan 9 10:21:00 IST 2018 Running postscript: syncfiles
Tue Jan 9 10:21:00 IST 2018 postscript syncfiles return with 0
Tue Jan 9 10:21:00 IST 2018 Running postscript: ospkgs
Tue Jan 9 10:21:05 IST 2018 postscript ospkgs return with 0
Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_install_date
Tue Jan 9 10:21:05 IST 2018 postscript tv_install_date return with 0
Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_hardents
Tue Jan 9 10:21:05 IST 2018 postscript tv_hardents return with 0
Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_click_hpc_nodes
Tue Jan 9 10:21:17 IST 2018 postscript tv_click_hpc_nodes return with 0
Tue Jan 9 10:21:17 IST 2018 Running postscript: tv_mlnx_ofed
>>>>>>>>>>>>>>>>>>>>>
HERE
<<<<<<<<<<<<<<<<<<<<<
Tue Jan 9 10:25:23 IST 2018 postscript tv_mlnx_ofed return with 0
Tue Jan 9 10:25:23 IST 2018 Running postscript: tv_setupservices
Tue Jan 9 10:25:23 IST 2018 postscript tv_setupservices return with 0
immarvin commented 6 years ago

btw, I tried to recreate the provision rhels6.5-x86_64 on iDataPlex dx360 M4, the provision works fine.

[root@c910f04x30v02 rpms]# pasu c910f04x29 show
c910f04x29: IMM.Cert_CSR_Export_Format=DER
c910f04x29: IMM.SSH_SERVER_KEY=Installed
c910f04x29: IMM.SSL_HTTPS_SERVER_CERT=Private Key and CA-signed cert installed.
c910f04x29: IMM.SSL_HTTPS_SERVER_CSR=Private Key and CA-signed cert installed.
c910f04x29: IMM.SSL_LDAP_CLIENT_CERT=Private Key and Cert/CSR not available.
c910f04x29: IMM.SSL_LDAP_CLIENT_CSR=Private Key and Cert/CSR not available.
c910f04x29: IMM.SSL_SERVER_DIRECTOR_CERT=Private Key and CA-signed cert installed.
c910f04x29: IMM.SSL_SERVER_DIRECTOR_CSR=Private Key and CA-signed cert installed.
c910f04x29: IMM.SSL_SKR_CLIENT_CERT=Cannot get Status. The FOD activation key is not installed!
c910f04x29: IMM.SSL_SKR_CLIENT_CSR=Cannot get Status. The FOD activation key is not installed!
c910f04x29: IMM.SSL_CLIENT_TRUSTED_CERT_SKR=Cannot get Status. The FOD activation key is not installed!
c910f04x29: IMM.SSL_CLIENT_TRUSTED_CERT1=Not-Installed
c910f04x29: IMM.SSL_CLIENT_TRUSTED_CERT2=Not-Installed
c910f04x29: IMM.SSL_CLIENT_TRUSTED_CERT3=Not-Installed
c910f04x29: IMM.PowerRestorePolicy=Restore
c910f04x29: IMM.ThermalModePolicy=Normal
c910f04x29: IMM.FrontButton_PWR_Perm=Enabled
c910f04x29: IMM.ForceBootToUefi=Disabled
c910f04x29: IMM.AutoROMPromotion=Enabled
c910f04x29: IMM.PowerOnAtSpecifiedTime=0:0:0:0:0
c910f04x29: IMM.ShutdownAndPowerOff=WD:HH:MM
c910f04x29: IMM.PowerOnServer=WD:HH:MM
c910f04x29: IMM.ShutdownAndRestart=WD:HH:MM
c910f04x29: IMM.PXE_NextBootEnabled=Disabled
c910f04x29: IMM.TimeZone=GMT+0:00
c910f04x29: IMM.DST=Off
c910f04x29: IMM.IMMInfo_Name=
c910f04x29: IMM.IMMInfo_Contact=
c910f04x29: IMM.IMMInfo_Location=
c910f04x29: IMM.IMMInfo_RoomId=
c910f04x29: IMM.IMMInfo_RackId=
c910f04x29: IMM.IMMInfo_Lowest_U=0
c910f04x29: IMM.IMMInfo_Height=2
c910f04x29: IMM.IMMInfo_BladeBay=0
c910f04x29: IMM.OSWatchdog=Disabled
c910f04x29: IMM.LoaderWatchdog=Disabled
c910f04x29: IMM.PowerOffDelay=0.5
c910f04x29: IMM.NTPAutoSynchronization=Disabled
c910f04x29: IMM.NTPHost1=
c910f04x29: IMM.NTPHost2=
c910f04x29: IMM.NTPHost3=
c910f04x29: IMM.NTPHost4=
c910f04x29: IMM.NTPFrequency=3
c910f04x29: IMM.LoginId.1=USERID
c910f04x29: IMM.LoginId.10=
c910f04x29: IMM.LoginId.11=
c910f04x29: IMM.LoginId.12=
c910f04x29: IMM.LoginId.2=
c910f04x29: IMM.LoginId.3=
c910f04x29: IMM.LoginId.4=
c910f04x29: IMM.LoginId.5=
c910f04x29: IMM.LoginId.6=
c910f04x29: IMM.LoginId.7=
c910f04x29: IMM.LoginId.8=
c910f04x29: IMM.LoginId.9=
c910f04x29: IMM.AuthorityLevel.1=Supervisor
c910f04x29: IMM.UserAccountManagementPriv.1=No
c910f04x29: IMM.RemoteConsolePriv.1=No
c910f04x29: IMM.RemoteConsoleDiskPriv.1=No
c910f04x29: IMM.RemotePowerPriv.1=No
c910f04x29: IMM.ClearEventLogPriv.1=No
c910f04x29: IMM.BasicAdapterConfigPriv.1=No
c910f04x29: IMM.AdapterConfigNetworkSecurityPriv.1=No
c910f04x29: IMM.AdvancedAdapterConfigPriv.1=No
c910f04x29: IMM.SNMPv3_AuthenticationProtocol.1=NONE
c910f04x29: IMM.SNMPv3_PrivacyProtocol.1=NONE
c910f04x29: IMM.SNMPv3_AccessType.1=Get
c910f04x29: IMM.SNMPv3_TrapHostname.1=
c910f04x29: IMM.User_Authentication_Method=Local only
c910f04x29: IMM.WebTimeout=20 minutes
c910f04x29: IMM.AccountSecurity=Legacy security settings
c910f04x29: IMM.LockoutPeriod=2
c910f04x29: IMM.LoginPassword=Disabled
c910f04x29: IMM.PasswordReuse=Disabled
c910f04x29: IMM.ComplexPassword=Disabled
c910f04x29: IMM.PasswordAge=0
c910f04x29: IMM.MinPasswordLen=0
c910f04x29: IMM.PwChangeInterval=0
c910f04x29: IMM.PwMaxFailure=5
c910f04x29: IMM.PwDiffChar=0
c910f04x29: IMM.DefPasswordExp=Disabled
c910f04x29: IMM.FirstAccessPwChange=Disabled
c910f04x29: IMM.RetryLimit=5 times
c910f04x29: IMM.EntriesDelay=0.5 minutes
c910f04x29: IMM.RetryDelay=0.5 minutes
c910f04x29: IMM.SNMPAlerts_CriticalAlert=Disabled
c910f04x29: IMM.SNMPAlerts_WarningAlert=Disabled
c910f04x29: IMM.SNMPAlerts_SystemAlert=Disabled
c910f04x29: IMM.SNMPAlerts_CriticalAlertCategory=none
c910f04x29: IMM.SNMPAlerts_WarningAlertCategory=none
c910f04x29: IMM.SNMPAlerts_SystemAlertCategory=none
c910f04x29: IMM.SerialRedirectionCLIMode1=CLI active / user defined keystroke sequences
c910f04x29: IMM.SerialExitCLIKeySequence=^[(
c910f04x29: IMM.SerialBaudRate=115200
c910f04x29: IMM.CIMOverHTTPPort=5988
c910f04x29: IMM.CIMOverHTTPSPort=5989
c910f04x29: IMM.HTTPPort=80
c910f04x29: IMM.SSLPort=443
c910f04x29: IMM.TelnetPort=23
c910f04x29: IMM.SSHPort=22
c910f04x29: IMM.SNMP_AgentPort=161
c910f04x29: IMM.SNMP_TrapPort=162
c910f04x29: IMM.RemoteConsolePort=3900
c910f04x29: IMM.HttpPortControl=Open
c910f04x29: IMM.HttpsPortControl=Open
c910f04x29: IMM.CIMOverHttpPortControl=Open
c910f04x29: IMM.CIMOverHttpsPortControl=Open
c910f04x29: IMM.TelnetLegacyPortControl=Open
c910f04x29: IMM.SSHLegacyPortControl=Open
c910f04x29: IMM.RemotePresencePortControl=Open
c910f04x29: IMM.SNMPAgentPortControl=Open
c910f04x29: IMM.SLPPortControl=Open
c910f04x29: IMM.FTPDataPortControl=Open
c910f04x29: IMM.FTPServerPortControl=Open
c910f04x29: IMM.SFTPPortControl=Open
c910f04x29: IMM.IMMFTPServerPortControl=Open
c910f04x29: IMM.IMMDebugPortControl=Open
c910f04x29: IMM.DHCPClientPortControl=Open
c910f04x29: IMM.DHCPBootPCClientPortControl=Open
c910f04x29: IMM.CMMIPMIPortControl=Open
c910f04x29: IMM.LanOverUsb=Enabled
c910f04x29: IMM.LanOverUsbIMMIP=169.254.95.118
c910f04x29: IMM.LanOverUsbIMMNetmask=255.255.0.0
c910f04x29: IMM.LanOverUsbHostIP=169.254.95.120
c910f04x29: IMM.PortForwarding=Enabled
c910f04x29: IMM.Network1=Enabled
c910f04x29: IMM.DHCP1=Disabled
c910f04x29: IMM.HostName1=IMM2-3440b5b9b94a
c910f04x29: IMM.HostIPAddress1=10.4.29.2
c910f04x29: IMM.HostIPSubnet1=255.0.0.0
c910f04x29: IMM.GatewayIPAddress1=10.0.0.101
c910f04x29: IMM.SharedNicMode=Dedicated
c910f04x29: IMM.FailoverMode=SharedOption_1
c910f04x29: IMM.NetworkSettingSync=Disabled
c910f04x29: IMM.DHCPAssignedHostname=IMM2-3440b5b9b94a
c910f04x29: IMM.DHCPAssignedHostIP1=10.4.29.2
c910f04x29: IMM.DHCPAssignedGateway1=10.0.0.101
c910f04x29: IMM.DHCPAssignedNetMask1=255.0.0.0
c910f04x29: IMM.DHCPAssignedDomainName=
c910f04x29: IMM.DHCPAssignedPrimaryDNS1=0.0.0.0
c910f04x29: IMM.DHCPAssignedSecondaryDNS1=0.0.0.0
c910f04x29: IMM.DHCPAssignedTertiaryDNS1=0.0.0.0
c910f04x29: IMM.IPv6Network1=Enabled
c910f04x29: IMM.IPv6Static1=Disabled
c910f04x29: IMM.IPv6DHCP1=Enabled
c910f04x29: IMM.IPv6Stateless1=Enabled
c910f04x29: IMM.IPv6HostIPAddressWithPrefix1=::/64
c910f04x29: IMM.IPv6GatewayIPAddress1=::
c910f04x29: IMM.IPv6LinkLocalIPAddress1=fe80::3640:b5ff:feb9:b94a/64
c910f04x29: IMM.IPv6StatelessIPAddress1=::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0 ::/0
c910f04x29: IMM.IPv6StatelessGateway1=::
c910f04x29: IMM.IPv6DHCPAssignedHostIP1=::
c910f04x29: IMM.IPv6DHCPAssignedDomainName=
c910f04x29: IMM.IPv6DHCPAssignedPrimaryDNS1=::
c910f04x29: IMM.IPv6DHCPAssignedSecondaryDNS1=::
c910f04x29: IMM.IPv6DHCPAssignedTertiaryDNS1=::
c910f04x29: IMM.AutoNegotiate1=Yes
c910f04x29: IMM.LANDataRate1=Auto
c910f04x29: IMM.Duplex1=Auto
c910f04x29: IMM.MTU1=1500
c910f04x29: IMM.MACAddress1=34:40:b5:b9:b9:4a
c910f04x29: IMM.BurnedInMacAddress=34:40:b5:b9:b9:4a
c910f04x29: IMM.SNMPv1Agent=Disabled
c910f04x29: IMM.SNMPv3Agent=Disabled
c910f04x29: IMM.SNMPTraps=Disabled
c910f04x29: IMM.Community_Name.1=
c910f04x29: IMM.Community_Name.2=
c910f04x29: IMM.Community_Name.3=
c910f04x29: IMM.Community_AccessType.1=Get
c910f04x29: IMM.Community_AccessType.2=Get
c910f04x29: IMM.Community_AccessType.3=Get
c910f04x29: IMM.Community_HostIPAddress1.1=
c910f04x29: IMM.Community_HostIPAddress1.2=
c910f04x29: IMM.Community_HostIPAddress1.3=
c910f04x29: IMM.Community_HostIPAddress2.1=
c910f04x29: IMM.Community_HostIPAddress2.2=
c910f04x29: IMM.Community_HostIPAddress2.3=
c910f04x29: IMM.Community_HostIPAddress3.1=
c910f04x29: IMM.Community_HostIPAddress3.2=
c910f04x29: IMM.Community_HostIPAddress3.3=
c910f04x29: IMM.DNS_Enable=Disabled
c910f04x29: IMM.DNSPreference=IPv6
c910f04x29: IMM.DNS_IP_Address1=0.0.0.0
c910f04x29: IMM.DNS_IP_Address2=0.0.0.0
c910f04x29: IMM.DNS_IP_Address3=0.0.0.0
c910f04x29: IMM.IPv6DNS_IP_Address1=::
c910f04x29: IMM.IPv6DNS_IP_Address2=::
c910f04x29: IMM.IPv6DNS_IP_Address3=::
c910f04x29: IMM.DDNS_Enable=Enabled
c910f04x29: IMM.DDNSPreference=DHCP
c910f04x29: IMM.Custom_Domain=
c910f04x29: IMM.TelnetSessions=2
c910f04x29: IMM.SMTP_ServerName=0.0.0.0
c910f04x29: IMM.SMTP_Port=25
c910f04x29: IMM.SMTP_Authentication=Disabled
c910f04x29: IMM.SMTP_UserName=
c910f04x29: IMM.SMTP_AuthMethod=CRAM-MD5
c910f04x29: IMM.Select_LDAP_Servers=Use Pre-Configured LDAP Servers
c910f04x29: IMM.Search_Domain=
c910f04x29: IMM.LDAP_Server1_HostName_IPAddress=0.0.0.0
c910f04x29: IMM.LDAP_Server1_Port=389
c910f04x29: IMM.LDAP_Server2_HostName_IPAddress=
c910f04x29: IMM.LDAP_Server2_Port=389
c910f04x29: IMM.LDAP_Server3_HostName_IPAddress=
c910f04x29: IMM.LDAP_Server3_Port=389
c910f04x29: IMM.LDAP_Server4_HostName_IPAddress=
c910f04x29: IMM.LDAP_Server4_Port=389
c910f04x29: IMM.SKR_Server1_HostName_IPAddress=
c910f04x29: IMM.SKR_Server1_Port=
c910f04x29: IMM.SKR_Server2_HostName_IPAddress=
c910f04x29: IMM.SKR_Server2_Port=
c910f04x29: IMM.SKR_Server3_HostName_IPAddress=
c910f04x29: IMM.SKR_Server3_Port=
c910f04x29: IMM.SKR_Server4_HostName_IPAddress=
c910f04x29: IMM.SKR_Server4_Port=
c910f04x29: IMM.SKR_DEVICE_GROUP=
c910f04x29: IMM.Root_DN=
c910f04x29: IMM.UID_Search=sAMAccountName
c910f04x29: IMM.BindingMethod=Anonymous Bind
c910f04x29: IMM.ClientDN=
c910f04x29: IMM.RoleBasedSecurity=Disabled
c910f04x29: IMM.ServerTargetName=
c910f04x29: IMM.GroupFilter=
c910f04x29: IMM.Group_Search_Attribute=memberOf
c910f04x29: IMM.AuthorizationMethod=authorization will be done in LDAP Server.
c910f04x29: IMM.Forest_Name=
c910f04x29: IMM.Login_Permission_Attribute=
c910f04x29: IMM.SSL_Server_Enable=Disabled
c910f04x29: IMM.CIMXMLOverHTTPS_Enable=Disabled
c910f04x29: IMM.SSL_Client_Enable=Disabled
c910f04x29: IMM.SSH_Enable=Enabled
c910f04x29: IMM.VMMigration_EventCategoryType=none
c910f04x29: IMM.VMMigration_EventCategory=custom:memo|cool
c910f04x29: SYSTEM_PROD_DATA.SysInfoProdName=xxxx
c910f04x29: SYSTEM_PROD_DATA.SysInfoProdIdentifier=IBM System X iDataPlex dx360 M4 Server
c910f04x29: SYSTEM_PROD_DATA.SysInfoSerialNum=zzzzz
c910f04x29: SYSTEM_PROD_DATA.SysInfoUUID=zzzz
c910f04x29: SYSTEM_PROD_DATA.SysInfoUDI=
c910f04x29: SYSTEM_PROD_DATA.SysEncloseAssetTag=
c910f04x29: BootOrder.BootOrder=Red Hat Enterprise Linux=sles-secureboot=CentOS=Red Hat Enterprise Linux 6=SUSE Linux Enterprise Server 11 SP2 - B - Boot0014=SUSE Linux Enterprise Server 11 SP2 - B - Boot0013=SUSE Linux Enterprise Server 11 SP2 - B - Boot0012=SUSE Linux Enterprise Server 11 SP2 - B - Boot0010=SUSE Linux Enterprise Server 11 SP2 - B - Boot000F=SUSE Linux Enterprise Server 11 SP2 - B - Boot000E=SUSE Linux Enterprise Server 11 SP2 - B - Boot000D=SUSE Linux Enterprise Server 11 SP2 - B - Boot000C=SUSE Linux Enterprise Server 11 SP2 - B - Boot000B=SUSE Linux Enterprise Server 11 SP2 - B - Boot000A=SUSE Linux Enterprise Server 11 SP2 - B - Boot0009=SUSE Linux Enterprise Server 11 SP2 - B - Boot0008=SUSE Linux Enterprise Server 11 SP2 - B - Boot0007=SUSE Linux Enterprise Server 11 SP2 - B - Boot0006=Legacy Only=CD/DVD Rom=Floppy Disk=Hard Disk 0=PXE Network
c910f04x29: BootOrder.WolBootOrder=PXE Network=CD/DVD Rom=Hard Disk 0
c910f04x29: PXE.NicPortMacAddress.1=xxxxx
c910f04x29: PXE.NicPortMacAddress.2=xxxxx
c910f04x29: PXE.NicPortPxeMode.1=UEFI and Legacy Support
c910f04x29: PXE.NicPortPxeMode.2=UEFI and Legacy Support
c910f04x29: PXE.NicPortPxeProtocol.1=IPV4
c910f04x29: PXE.NicPortPxeProtocol.2=IPV4
c910f04x29: POSTAttempts.POSTAttemptsLimit=3
c910f04x29: Processors.TurboMode=Disable
c910f04x29: Processors.ProcessorPerformanceStates=Enable
c910f04x29: Processors.C-States=Enable
c910f04x29: Processors.PackageACPIC-StateLimit=ACPI C3
c910f04x29: Processors.C1EnhancedMode=Enable
c910f04x29: Processors.Hyper-Threading=Enable
c910f04x29: Processors.ExecuteDisableBit=Enable
c910f04x29: Processors.IntelVirtualizationTechnology=Enable
c910f04x29: Processors.HardwarePrefetcher=Enable
c910f04x29: Processors.AdjacentCachePrefetch=Enable
c910f04x29: Processors.DCUStreamerPrefetcher=Enable
c910f04x29: Processors.DCUIPPrefetcher=Enable
c910f04x29: Processors.DirectCacheAccessDCA=Disable
c910f04x29: Processors.CoresinCPUPackage=All
c910f04x29: Processors.QPILinkFrequency=Max Performance
c910f04x29: Processors.PA7Interleave=Auto
c910f04x29: Memory.MemoryMode=Independent
c910f04x29: Memory.MemorySpeed=Max Performance
c910f04x29: Memory.MemoryPowerManagement=Disable
c910f04x29: Memory.SocketInterleave=NUMA
c910f04x29: Memory.PatrolScrub=Disable
c910f04x29: Memory.PagePolicy=Adaptive
c910f04x29: Memory.MemoryRefresh=1x
c910f04x29: Memory.DIMM1onProcessor1=Enable
c910f04x29: Memory.DIMM2onProcessor1=Enable
c910f04x29: Memory.DIMM3onProcessor1=Enable
c910f04x29: Memory.DIMM4onProcessor1=Enable
c910f04x29: Memory.DIMM5onProcessor1=Enable
c910f04x29: Memory.DIMM6onProcessor1=Enable
c910f04x29: Memory.DIMM7onProcessor1=Enable
c910f04x29: Memory.DIMM8onProcessor1=Enable
c910f04x29: Memory.DIMM9onProcessor2=Enable
c910f04x29: Memory.DIMM10onProcessor2=Enable
c910f04x29: Memory.DIMM11onProcessor2=Enable
c910f04x29: Memory.DIMM12onProcessor2=Enable
c910f04x29: Memory.DIMM13onProcessor2=Enable
c910f04x29: Memory.DIMM14onProcessor2=Enable
c910f04x29: Memory.DIMM15onProcessor2=Enable
c910f04x29: Memory.DIMM16onProcessor2=Enable
c910f04x29: DevicesandIOPorts.ConfiguretheonboardSATAportsas=AHCI
c910f04x29: DevicesandIOPorts.ActiveVideo=Onboard Device
c910f04x29: DevicesandIOPorts.PCIExpressNativeControl=Enable
c910f04x29: DevicesandIOPorts.PCI64-BitResourceAllocation=Disable
c910f04x29: DevicesandIOPorts.MMConfigBase=2GB
c910f04x29: DevicesandIOPorts.COMPort1=Enable
c910f04x29: DevicesandIOPorts.COMPort2=Enable
c910f04x29: DevicesandIOPorts.RemoteConsole=Enable
c910f04x29: DevicesandIOPorts.SerialPortSharing=Enable
c910f04x29: DevicesandIOPorts.SerialPortAccessMode=Dedicated
c910f04x29: DevicesandIOPorts.SPRedirection=Disable
c910f04x29: DevicesandIOPorts.LegacyOptionROMDisplay=COM Port 1
c910f04x29: DevicesandIOPorts.Com1BaudRate=115200
c910f04x29: DevicesandIOPorts.Com1DataBits=8
c910f04x29: DevicesandIOPorts.Com1Parity=None
c910f04x29: DevicesandIOPorts.Com1StopBits=1
c910f04x29: DevicesandIOPorts.Com1TerminalEmulation=ANSI
c910f04x29: DevicesandIOPorts.Com1ActiveAfterBoot=Enable
c910f04x29: DevicesandIOPorts.Com1FlowControl=Hardware
c910f04x29: DevicesandIOPorts.Com2BaudRate=115200
c910f04x29: DevicesandIOPorts.Com2DataBits=8
c910f04x29: DevicesandIOPorts.Com2Parity=None
c910f04x29: DevicesandIOPorts.Com2StopBits=1
c910f04x29: DevicesandIOPorts.Com2TerminalEmulation=ANSI
c910f04x29: DevicesandIOPorts.Com2ActiveAfterBoot=Disable
c910f04x29: DevicesandIOPorts.Com2FlowControl=Disable
c910f04x29: DevicesandIOPorts.Mezz_CardPCIeSpeed=Gen3
c910f04x29: DevicesandIOPorts.SLOT1PCIeSpeed=Gen3
c910f04x29: DevicesandIOPorts.SLOT2PCIeSpeed=Gen3
c910f04x29: DevicesandIOPorts.Ethernet1LEGACYOPROM=Enable
c910f04x29: DevicesandIOPorts.Ethernet2LEGACYOPROM=Enable
c910f04x29: DevicesandIOPorts.VideoLEGACYOPROM=Enable
c910f04x29: DevicesandIOPorts.Mezz_CardLEGACYOPROM=Enable
c910f04x29: DevicesandIOPorts.SLOT1LEGACYOPROM=Enable
c910f04x29: DevicesandIOPorts.SLOT2LEGACYOPROM=Enable
c910f04x29: DevicesandIOPorts.Ethernet1UEFIOPROM=Enable
c910f04x29: DevicesandIOPorts.Ethernet2UEFIOPROM=Enable
c910f04x29: DevicesandIOPorts.VideoUEFIOPROM=Enable
c910f04x29: DevicesandIOPorts.Mezz_CardUEFIOPROM=Enable
c910f04x29: DevicesandIOPorts.SLOT1UEFIOPROM=Enable
c910f04x29: DevicesandIOPorts.SLOT2UEFIOPROM=Enable
c910f04x29: DevicesandIOPorts.Ethernet=Enable
c910f04x29: DevicesandIOPorts.Video=Enable
c910f04x29: DevicesandIOPorts.Mezz_Card=Enable
c910f04x29: DevicesandIOPorts.SLOT1=Enable
c910f04x29: DevicesandIOPorts.SLOT2=Enable
c910f04x29: DevicesandIOPorts.SetOptionROMExecutionOrder=Ethernet 2=Video=Mezz_Card=SLOT2=Ethernet 1=SLOT1
c910f04x29: DevicesandIOPorts.GpuAutoEnabling64BitResource=Enable
c910f04x29: Power.ActiveEnergyManager=Capping Enabled
c910f04x29: Power.S3Enable=Enable
c910f04x29: Power.PowerPerformanceBias=Platform Controlled
c910f04x29: Power.PlatformControlledType=Maximum Performance
c910f04x29: Power.WorkloadConfiguration=I/O sensitive
c910f04x29: Power.CPUCoolingBoost=0
c910f04x29: Power.SystemCoolingBoost=Normal
c910f04x29: BootModes.SystemBootMode=UEFI and Legacy
c910f04x29: BootModes.OptimizedBoot=Disable
c910f04x29: BootModes.QuietBoot=Enable
c910f04x29: OperatingModes.ChooseOperatingMode=Custom Mode
c910f04x29: LegacySupport.ForceLegacyVideoonBoot=Enable
c910f04x29: LegacySupport.RehookINT19h=Disable
c910f04x29: LegacySupport.LegacyThunkSupport=Enable
c910f04x29: LegacySupport.InfiniteBootRetry=Disable
c910f04x29: LegacySupport.BBSBoot=Enable
c910f04x29: SystemSecurity.TPMDevice=Enable
c910f04x29: SystemSecurity.TPMState=Activate
c910f04x29: SystemSecurity.TXTState=Disable
c910f04x29: SystemSecurity.MORState=Disable
c910f04x29: SystemRecovery.POSTWatchdogTimer=Disable
c910f04x29: SystemRecovery.POSTWatchdogTimerValue=5
c910f04x29: SystemRecovery.RebootSystemonNMI=Enable
c910f04x29: SystemRecovery.HaltOnSevereError=Disable
c910f04x29: BackupBankManagement.BackupBankManagementMethod=User Managed
c910f04x29: DiskGPTRecovery.DiskGPTRecovery=Automatic
c910f04x29: iSCSI.InitiatorName=iqn.1986-03.com.example
yardan11 commented 6 years ago

Hi

This is normal that the Ofed take few min, this is the install time it take. The log supply was in the state that node finish the install - but did not reboot in order to continue the postbootscripts.

immarvin commented 6 years ago

hi @yardan11 , from the xcat.log you provided, the postscripts does not hang.

the osimage to provision the node involves some customization. is it possible to provide the customized /install/custom/install/rh/compute.rhels6.tmpl? thanks

Object name: rhels6.5-x86_64-install-compute
imagetype=linux
osarch=x86_64
osdistroname=rhels6.5-x86_64
osname=Linux
osvers=rhels6.5
otherpkgdir=/install/post/otherpkgs/rhels6.5/x86-64
otherpkglist=/install/custom/install/rh/compute.rhels6.5.otherpkgs.pkglist
pkgdir=/install/rhels6.5/x86_64
pkglist=/install/custom/install/rh/compute.rhels6.5.x86_64.pkglist
profile=compute
provmethod=install
template=/install/custom/install/rh/compute.rhels6.tmpl
robin2008 commented 6 years ago

@yardan11 Do you have any remote mount (NFS or GPFS) in your postscripts?

And not sure if any strange could be seen on the console when reboot is triggered.

yardan11 commented 6 years ago

Hi, I found the github thread regarding our issue and noticed a few questions there that you didn't send to me yet, so I will just send you the answers to save time.

  1. compute.rhels6.tmpl contents:
    
    lang en_US
    #KICKSTARTNET#
    # xCAT url web repository
    #url --url http://#TABLE:noderes:$NODE:nfsserver#/install/#TABLE:nodetype:$NODE:os#/#TABLE:nodetype:$NODE:arch#
    %include /tmp/repos
    keyboard "us"
    zerombr
    #clearpart --all --initlabel
    clearpart --drives=sda,sdb
    key --skip
    #text
    cmdline
    part / --fstype=ext4 --size=167936 --ondisk=sda --label=ROOT
    part /boot/efi --fstype=efi --size=537 --ondisk=sda
    part /dump --fstype=ext4 --size=20480 --ondisk=sda --label=DUMP
    #part swap --size=4096 --ondisk=sda
    part /local --fstype=ext4 --size=102400 --grow --ondisk=sdb --label=LOCAL
    #bootloader 
    #KICKSTARTBOOTLOADER#
    install
    firewall --disabled
    timezone --utc "#TABLE:site:key=timezone:value#"
    skipx
    rootpw --iscrypted #CRYPT:passwd:key=system,username=root:password#
    #auth --useshadow --enablemd5
    auth --useshadow --passalgo=sha512
    selinux --disabled
    reboot
    %packages
    #INCLUDE_DEFAULT_PKGLIST#
    %end
    %pre
    {
    echo "Running Kickstart Pre-Installation script..."
    #INCLUDE:#ENV:XCATROOT#/share/xcat/install/scripts/pre.rh#
    dd bs=512 count=10 if=/dev/zero of=/dev/sda
    dd bs=512 count=10 if=/dev/zero of=/dev/sdb
    parted --script /dev/sda mklabel gpt
    parted --script /dev/sdb mklabel gpt
    parted --script /dev/sda print
    parted --script /dev/sdb print
    sleep 30
    } >>/tmp/pre-install.log 2>&1
    %end
    %post 
    #exec < /dev/tty6 > /dev/tty6 2>/dev/tty6
    #chvt 6
    ## ##INCLUDE:#ENV:XCATROOT#/share/xcat/install/scripts/post.rh##
    #chvt 1
    mkdir -p /var/log/xcat/
    {
    cat >> /var/log/xcat/xcat.log << "EOF"
    %include /tmp/pre-install.log

EOF echo "Running Kickstart Post-Installation script..."

INCLUDE:#ENV:XCATROOT#/share/xcat/install/scripts/post.rh

} >>/var/log/xcat/xcat.log 2>&1 %end



2. We checked the console during the hang and there is nothing interesting. We also had a look on the node's screen during the hang (checked several ttys) but didn't find anything interesting. 
immarvin commented 6 years ago

hi @yardan11 , cannot find any clue for the customized template.

would you please provide the kickstart file /install/autoinst/<node> for the node when site.xcatdebugmode is enabled and disabled with the steps: 1) disable the xcatdebugmode, then run nodeset <node> osimage, obtain the /install/autoinst/<node> 2) enable the xcatdebugmode, then run nodeset <node> osimage, obtain the /install/autoinst/<node>

you can send the 2 files to my mail(we talked on the ST before, you can find my mail address)

thx

yardan11 commented 6 years ago

Hi, sorry about the delay. Both files are attached - "debug" is the autoinst of the node with xcatdebugmode=1, and "no_debug" is the autoinst of the node with xcatdebugmode=0. debug.txt no_debug.txt

We have inspected both files thoroughly and we have a breakthrough. On both files there is a section that handles messages received from a socket called newSocket, if they start with "sh". See line 231 on both files.

This code is commented when debug mode is off, but not commented when debug mode is on. That was very suspicious to us, although we don't really know what those lines do. So we tried the following: we made sure that debug mode is OFF (and as we mentioned on previous messages, the node does not reboot when debug mode is off). Then we ran "nodeset", edited the autoinst file and removed the comment from these lines, so they will run despite the fact that debug mode is off. We did rpower boot to the node and waited. And it worked - the node did reboot and the installation did complete!

yardan11 commented 6 years ago

Also during the install with debug off - the node finish all post scripts, I see that after the last post script was run - ps -ef | grep -i xcat return: /usr/bin/python /usr/bin/anaconda --stage2 http://xcat:80/install/rhels6.5/x86_64/images/install.img -dlabel --kickstart /tmp/ks.cfg --serial -T -C --selinux --lang en_US --keymap us --repo http://xcat:80/install/rhel6.5/x86_64

[xcatflowrequest]

immarvin commented 6 years ago

hi @yardan11 , interesting find. the new socket for debug purpose , which accepts commands from MN and run it inside anaconda installer, then send back the command output. But I cannot find why the following commented out lines will prevent the installation from reboot, since the socket is open no matter whether the lines are commented out or not, and according to our experience, the background process "foo.py" will be killed on reboot.

#UNCOMMENTOENABLEDEBUGPORT#         if(command[0] == "sh"): #DEBUG purposes only, wide open root priv command here.
#UNCOMMENTOENABLEDEBUGPORT#             newcommand = ""
#UNCOMMENTOENABLEDEBUGPORT#             for i in command[1:]:
#UNCOMMENTOENABLEDEBUGPORT#                 newcommand = newcommand + i + " "
#UNCOMMENTOENABLEDEBUGPORT#             output = os.popen(newcommand).read()
#UNCOMMENTOENABLEDEBUGPORT#             newSocket.send(output)
#UNCOMMENTOENABLEDEBUGPORT#             break
itziksc commented 6 years ago

Hi,

I will continue to work and update this issue replacing @yardan11 on this.

here are the latest information attach:

Thanks, Itzik netstatnlp.txt psaux.txt

immarvin commented 6 years ago

hi @itziksc , thanks for the info. ps, you can contact me thru ST (Song BJ Yang)

immarvin commented 6 years ago

hi @itziksc , discussed with the team, cannot find any hint for the real cause. would you please turn off the xcatdebugmode and do not run the following customized postscripts during provision(remove them from osimage.postscripts? the postscripts of the group?) , then look whether this issue can be recreated. if not, the issue might be in these customized user scripts


Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_install_date
Tue Jan 9 10:21:05 IST 2018 postscript tv_install_date return with 0
Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_hardents
Tue Jan 9 10:21:05 IST 2018 postscript tv_hardents return with 0
Tue Jan 9 10:21:05 IST 2018 Running postscript: tv_click_hpc_nodes
Tue Jan 9 10:21:17 IST 2018 postscript tv_click_hpc_nodes return with 0
Tue Jan 9 10:21:17 IST 2018 Running postscript: tv_mlnx_ofed
Tue Jan 9 10:25:23 IST 2018 postscript tv_mlnx_ofed return with 0
Tue Jan 9 10:25:23 IST 2018 Running postscript: tv_setupservices
Tue Jan 9 10:25:23 IST 2018 postscript tv_setupservices return with 0
itziksc commented 6 years ago

Hi @immarvin , Here is the latest update from customer :

  1. Yes, it is from inside the installer. searched all the postscripts and postbootscripts, the word "xcatd" is NOT mentioned at all in them.

  2. I have removed the custom postscripts and left the defaults. The installation still hangs.

    Issue also exist with no postscripts...

What do you think we should check next....? thanks,

itziksc commented 6 years ago

hi @immarvin ,

what else should we check?

thanks, Itzik

immarvin commented 6 years ago

hi @yardan11 , is the issue still there?