xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
356 stars 170 forks source link

install of RHEL8.2.0 fails due to missing environment variables MASTER_IP, MASTER in xcatinstall #6837

Open dombrowa opened 3 years ago

dombrowa commented 3 years ago

the xcatinstall postscript is not filling in the vars MASTER_IP and MASTER when provisioning RHEL8 which starts with post.xcat.ng instead of post.xcat (compared to RHEL7) MASTER_IP is e.g. required for the logger inside msgutil_r MASTER is required for the updateflag.awk script

  1. Error noticed in xcat.log "the network between the node and $MASTER_IP is not ready, please check" in which the message has no MASTER_IP

I added this code to find out if the value is available and could be added: Yes.

if [ "$MASTER_IP" = "" ];
then
    buf="#ENV:MASTER_IP#"
    if [ "$buf" != "" ]; then
         msg="Setting MASTER_IP from xCAT env ENV:MASTER_IP"
    else
         buf="172.16.16.1"
         msg="Setting MASTER_IP to $buf"
    fi
    export MASTER_IP=$buf
    echo "xcatinstallpost $msg" >> /root/post.xcat.log
fi
RETRY=0
while true; do
    #check whether the network access between MN/CN and the node is ready                                                                                                         
    if ping $MASTER_IP -c 1 >/dev/null ; then
        echo "xcatinstallpost [$RETRY/90] Ping response from $MASTER_IP" >> /root/post.xcat.log
        break
    else
        echo "xcatinstallpost [$RETRY/90] No ping response from $MASTER_IP yet" >> /root/post.xcat.log
    fi

    RETRY=$[ $RETRY + 1 ]

    if [ $RETRY -eq 90 ];then
       #timeout, complain and exit                                                                                                                                                
       msgutil_r "$MASTER_IP" "error" "the network between the node and $MASTER_IP is not ready, please check[retry=$RETRY]..." "/var/log/xcat/xcat.log" "$log_label"
       exit 1
    fi

    #sleep sometime before the next scan                                                                                                                                          
    sleep 2
done

Running with the above code (and more) added will trigger this output

[root@netsres46 ~]# cat /root/post.xcat.log
...
xcatinstallpost Setting NODE to netsres46 from xCAT env TABLE:nodelist:THISNODE:node
  1. Error, the final installstatus update fails because MASTER is empty
    flag update failed
    or
    updateflag.awk: Retrying flag update
    updateflag.awk: Retrying flag update
    updateflag.awk: Retrying flag update

    This code added to xcatinstallpost shows that the value for MASTER is missing as well:

    if [ "$MASTER" = "" ];
    then
    buf="#XCATVAR:XCATMASTER#"
    if [ "$buf" != "" ]; then
         msg="Setting MASTER from xCAT env XCATVAR:XCATMASTER"
    else
         buf="172.16.16.1"
         msg="Setting MASTER to $buf"
    fi
    export MASTER=$buf
    echo "xcatinstallpost $msg" >> /root/post.xcat.log
    fi

    Running with the above code (and more) added will trigger this output

    ...
    xcatinstallpost Setting MASTER from xCAT env XCATVAR:XCATMASTER <<< Error 2 prevention
    ...
cxhong commented 3 years ago

did u define master in the site table? Can u run xcatprobe xcatmn -i <provision network interface> ?

dombrowa commented 3 years ago

Yes, master is defined:

[root@netsres-xcat ~]# tabdump site | grep -i master
"master","172.16.16.1",,

Output from xcatprobe:

[root@netsres-xcat ~]# xcatprobe xcatmn -i eth1
[mn]: Checking all xCAT daemons are running...                                                                    [ OK ]
[mn]: Checking xcatd can receive command request...                                                               [ OK ]
[mn]: Checking 'site' table is configured...                                                                      [ OK ]
[mn]: Checking provision network is configured...                                                                 [ OK ]
[mn]: Checking 'passwd' table is configured...                                                                    [ OK ]
[mn]: Checking important directories(installdir,tftpdir) are configured...                                        [ OK ]
[mn]: Checking SELinux is disabled...                                                                             [ OK ]
[mn]: Checking HTTP service is configured...                                                                      [ OK ]
[mn]: Checking TFTP service is configured...                                                                      [ OK ]
[mn]: Checking DNS service is configured...                                                                       [ OK ]
[mn]: Checking DHCP service is configured...                                                                      [ OK ]
[mn]: Checking NTP service is configured...                                                                       [ OK ]
[mn]: Checking rsyslog service is configured...                                                                   [ OK ]
[mn]: Checking firewall is disabled...                                                                            [ OK ]
[mn]: Checking minimum disk space for xCAT ['/var' needs 1GB;'/install' needs 10GB;'/tmp' needs 1GB]...           [ OK ]
[mn]: Checking Linux ulimits configuration...                                                                     [ OK ]
[mn]: Checking network kernel parameter configuration...                                                          [ OK ]
[mn]: Checking xCAT daemon attributes configuration...                                                            [ OK ]
[mn]: Checking xCAT log is stored in /var/log/xcat/cluster.log...                                                 [WARN]
[mn]: Failed to store MN logs to /var/log/xcat/cluster.log
[mn]: Checking xCAT management node IP: <172.16.16.1> is configured to static...                                  [ OK ]
[mn]: Checking dhcpd.leases file is less than 100M...                                                             [ OK ]
=================================== SUMMARY ====================================
[MN]: Checking on MN...                                                                                           [ OK ]
    Checking xCAT log is stored in /var/log/xcat/cluster.log...                                                   [WARN]
        Failed to store MN logs to /var/log/xcat/cluster.log

I would like to add, that any other previous RHEL distro still installs fine just not RHEL8.2.0 I noticed that the mypostscript.tmpl does not seem to appear being used when I run

 rinstall <node>  osimage=rhels8.2.0-x86_64-install-netsres

(precreatemypostscript is not enabled) During the post tasks I see that the section in mypostscript starting with (when installing RHEL7.7

AUDITNOSYSLOG='0'
export AUDITNOSYSLOG
XCATCONFDIR='/etc/xcat'
export XCATCONFDIR
TFTPDIR='/tftpboot'
export TFTPDIR
PPCMAXP='64'
export PPCMAXP
...

ending with

export SNMPPRIV
SNMPAUTH=''
export SNMPAUTH
# postscripts-start-here
# defaults-postscripts-start-here
syslog
remoteshell
syncfiles
# defaults-postscripts-end-here
# osimage-postscripts-start-here
custom/rhels7.7-x86_64-install-netsres/compute.postinstall
# osimage-postscripts-end-here
# node-postscripts-start-here
confignetwork
setroute
# node-postscripts-end-here
# postscripts-end-here
# postbootscripts-start-here
# osimage-postbootscripts-start-here
custom/rhels7.7-x86_64-install-netsres/compute.postboot
# osimage-postbootscripts-end-here
# node-postbootscripts-start-here
syncfiles
console-rev.sh
net-peer-disable.sh
# node-postbootscripts-end-here
# postbootscripts-end-here

Is not included when installing RHEL8.2 which explains why no post scripts are run, no ssh config no variables are known

cxhong commented 3 years ago

Here is logic to determine the MASTER_IP

 #the logic to determine the $ENV{XCATMASTER} confirm to the following priority(from high to low):
    ## 1, the "xcatmaster" attribute of the node
    ## 2, the ip address of the mn/sn facing the compute node
    ## 3, the site.master

check the node definition, is xcatmaster defined? or run the command: nslookup <nodename> to make sure ip address can be resolved.

maybe you can show me the lsdef <nodename> and tabdump networks .

dombrowa commented 3 years ago

None of my nodes has xcatmaster defined:

netsres01: xcatmaster=
netsres02: xcatmaster=
netsres03: xcatmaster=
netsres04: xcatmaster=
netsres05: xcatmaster=
netsres06: xcatmaster=
netsres07: xcatmaster=
netsres08: xcatmaster=
netsres09: xcatmaster=
netsres10: xcatmaster=
netsres11: xcatmaster=
netsres12: xcatmaster=
netsres13: xcatmaster=
netsres14: xcatmaster=
netsres15: xcatmaster=
netsres16: xcatmaster=
netsres42: xcatmaster=
netsres42-vm1: xcatmaster=
netsres43: xcatmaster=
netsres44: xcatmaster=
netsres48: xcatmaster=
netsres49: xcatmaster=
netsres50: xcatmaster=
netsres51: xcatmaster=
netsres52: xcatmaster=
netsres54: xcatmaster=
netsres55: xcatmaster=
netsres56: xcatmaster=
netsres57: xcatmaster=
netsres58: xcatmaster=
netsres59: xcatmaster=
netsres60: xcatmaster=
netsres61: xcatmaster=
netsres62: xcatmaster=
netsres63: xcatmaster=
netsres74: xcatmaster=
netsres75: xcatmaster=
netsres76: xcatmaster=
netsres77: xcatmaster=
netsres78: xcatmaster=
netsres79: xcatmaster=
netsres80: xcatmaster=
netsres81: xcatmaster=
netsres82: xcatmaster=
netsres83: xcatmaster=
netsres84: xcatmaster=
netsres85: xcatmaster=
netsres86: xcatmaster=

Must I define this attribute?

  1. What do you mean with "the ip address of the mn/sn facing the compute node"? I have one interface 172.16.16.0/20 for all nodes. xcatmaster is 172.16.16.1 and each node has route to it.

  2. [root@netsres-xcat ~]# tabdump site | grep master "master","172.16.16.1",,

nodedef:

[root@netsres-xcat ~]# lsdef netsres46
Object name: netsres46
    addkcmdline=inst.sshd kernel.watchdog_thresh=30
    arch=x86_64
    cons=ipmi
    currchain=boot
    currstate=install rhels8.2.0-x86_64-netsres
    groups=all,vm
    ip=172.16.17.46
    mac=52:54:00:4b:2e:38
    mgt=kvm
    netboot=xnba
    nicdevices.br_blue=ens4
    nicdevices.br_green=ens3
    nichostnamesuffixes.br_blue=-blu
    nichostnamesuffixes.br_green=-gre
    nicips.ens3=172.16.17.46
    nicips.br_blue=9.2.156.70
    nicips.br_green=172.16.17.46
    nicnetworks.br_blue=blue
    nicnetworks.br_green=green
    nicnetworks.enp1s0f0=green
    nictypes.br_blue=bridge
    nictypes.ens3=ethernet
    nictypes.ens4=ethernet
    nictypes.enp1s0f0=ethernet
    nictypes.br_green=bridge
    os=rhels8.2.0
    postbootscripts=syncfiles,console-rev.sh,net-peer-disable.sh
    postscripts=syslog,remoteshell,syncfiles,confignetwork,setroute
    power=ipmi
    profile=netsres
    provmethod=rhels8.2.0-x86_64-install-netsres
    routenames=pubrt,greenrt
    serialport=0
    serialspeed=115200
    status=installing
    statustime=09-28-2020 16:14:07
    updatestatus=failed
    updatestatustime=09-26-2020 20:23:09

Networks table:

[root@netsres-xcat ~]# tabdump networks
#netname,net,mask,mgtifname,gateway,dhcpserver,tftpserver,nameservers,ntpservers,logservers,dynamicrange,staticrange,staticrangeincrement,nodehostname,ddnsdomain,vlanid,domain,mtu,comments,disable
"blue","9.2.156.64","255.255.255.192","eth2","9.2.156.65",,"9.2.156.71","9.2.250.86",,,,,,,,,,"1500",,
"green","172.16.16.0","255.255.240.0","eth1","<xcatmaster>",,"172.16.16.1",,,,"172.16.28.1-172.16.31.254",,,,,,,"1500",,
"nickel","172.16.80.0","255.255.240.0","lo0","172.16.80.1",,,,,,,,,,,,,"9000",,
"purple","172.16.32.0","255.255.240.0","eth3","172.16.32.1",,"172.16.32.1",,,,"172.16.44.1-172.16.47.254",,,,,,,"1500",,
"red","172.16.0.0","255.255.240.0","eth0","172.16.0.1",,"172.16.0.1",,,,"172.16.12.1-172.16.15.254",,,,,,,"1500",,
"silver","172.16.96.0","255.255.240.0","lo0","172.16.96.1",,,,,,,,,,,,,"9000",,
"zinc","172.16.48.0","255.255.240.0","lo0","172.16.48.1",,,,,,,,,,,,,"9000",,
"cadmium","172.16.176.0","255.255.240.0","lo0","172.16.176.1",,,,,,,,,,,,,"9000",,
"copper","172.16.144.0","255.255.240.0","lo0","172.16.144.1",,,,,,,,,,,,,"9000",,
"chromium","172.16.160.0","255.255.240.0","lo0","172.16.160.1",,,,,,,,,,,,,"9000",,
"titanium","172.16.192.0","255.255.240.0","lo0","172.16.192.1",,,,,,,,,,,,,"9000",,
"tungsten","172.16.208.0","255.255.240.0","lo0","172.16.208.1",,,,,,,,,,,,,"9000",,
"tantalum","172.16.224.0","255.255.240.0","lo0","172.16.224.1",,,,,,,,,,,,,"9000",,
"gold","172.16.240.0","255.255.240.0","lo0","172.16.240.1",,,,,,,,,,,,,"9000",,
"platinum","172.17.16.0","255.255.240.0","lo0","172.17.16.1",,,,,,,,,,,,,"9000",,
"mercury","172.17.32.0","255.255.240.0","lo0","172.17.32.1",,,,,,,,,,,,,"9000",,
"iridium","172.17.0.0","255.255.240.0","lo0","172.17.0.1",,,,,,,,,,,,,"9000",,
"iron","172.16.64.0","255.255.240.0","lo0","172.16.64.1",,,,,,,,,,,,,"9000",,
"cobalt","172.16.112.0","255.255.240.0","lo0","172.16.112.1",,,,,,,,,,,,,"9000",,
"manganese","172.16.128.0","255.255.240.0","lo0","172.16.128.1",,,,,,,,,,,,,"9000",,
"554","9.2.154.128","255.255.255.192","eth2","9.2.154.130",,"9.2.154.140","9.2.250.86",,,,,,,,,,"1500",,
"192_168_122_0-255_255_255_0","192.168.122.0","255.255.255.0","virbr0","<xcatmaster>",,"<xcatmaster>",,,,,,,,,,,"1500",,
dombrowa commented 3 years ago

Why would RHEl 7.7 install properly with all vars and postboot/postscripts included after firstboot in /xcatpost/mypostscript but not RHEL8*

cxhong commented 3 years ago

can u show me the lsdef -t osimage rhels8.2.0-x86_64-install-netsres ?

dombrowa commented 3 years ago
[root@netsres-xcat ~]# lsdef -t osimage rhels8.2.0-x86_64-install-netsres
Object name: rhels8.2.0-x86_64-install-netsres
    imagetype=linux
    osarch=x86_64
    osdistroname=rhels8.2.0-x86_64
    osname=Linux
    osvers=rhels8.2.0
    otherpkglist=/install/custom/rhels8.2.0-x86_64-install-netsres/pkglist-other
    pkgdir=/install/rhels8.2.0/x86_64
    pkglist=/install/custom/rhels8.2.0-x86_64-install-netsres/pkglist
    postbootscripts=custom/rhels8.2.0-x86_64-install-netsres/compute.postboot
    postscripts=custom/rhels8.2.0-x86_64-install-netsres/compute.postinstall
    profile=netsres
    provmethod=install
    synclists=/install/custom/rhels8.2.0-x86_64-install-netsres/synclist
    template=/install/custom/rhels8.2.0-x86_64-install-netsres/compute.rhels8.tmpl
cxhong commented 3 years ago

everything looks fine to me. If post.xcat.ng doesn't have MASTER_IP set, the /opt/xcat/share/xcat/install/scripts/pre.rhels8 should not have neither.
Can u check /install/autoinstall/<nodename> file? it will created after rinstall command, the MASTER_IP should be there already.
Also, after issue rinstall command, run xcatprobe osdeploy -n <nodename>,

dombrowa commented 3 years ago

The file /install/autoins/ shows MASTER_IP. There is no /install/autoinstall dir on my xcat server

[root@netsres-xcat ~]# grep MASTER_IP /install/autoinst/netsres46|head
export MASTER_IP="172.16.16.1"
msgutil_r "$MASTER_IP" "info" "============deployment starting============" "/var/log/xcat/xcat.log" "$log_label"
msgutil_r "$MASTER_IP" "info" "Running Anaconda Pre-Installation script..." "/var/log/xcat/xcat.log" "$log_label"
msgutil_r "$MASTER_IP" "info" "Detecting install disk..." "/var/log/xcat/xcat.log" "$log_label"
msgutil_r "$MASTER_IP" "info" "Found $instdisk, generate partition file..." "/var/log/xcat/xcat.log" "$log_label"
msgutil_r "$MASTER_IP" "info" "Generate the repository for the installation" "/var/log/xcat/xcat.log" "$log_label"

I am captuing xcatprobe osdeploy -n into a file

dombrowa commented 3 years ago

I notice that the curl command below does not find the mypostscript. Note: precreatemypostscripts has not been set in the site table (which always worked for other osimages)

[root@netsres-xcat ~]# lsdef -t site -i precreatemypostscripts
Object name: clustersite
    precreatemypostscripts=0
  1. Which step provides /tftpboot/mypostscripts/mypostscript.$NODE ?
  2. Must I set precreatemypostscripts to yes|1 ?
curl --fail --retry 20 --max-time 60 "http://$MASTER_IP:${HTTPPORT}$TFTPDIR/mypostscripts/mypostscript.$NODE" -o "/xcatpost/\
mypostscript.$NODE" 2> /tmp/download.log

Error shows as:

[root@netsres46 ~]# cat /tmp/download.log 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found

The mypostscrip.post after first reboot shows that the sections with export VARS and the postscripts label section which appear in other nodes when installing RHEL7.7:

# postscripts-start-here
......
# postscripts-end-here

are both missing. Therefore no postscripts were executed

This is what I see on the node

[root@netsres46 ~]# cat /xcatpost/mypostscript.post 

. /xcatpost/xcatlib.sh

# global value to store the running status of the postbootscripts,the value is non-zero if one postbootscript failed
return_value=0

# subroutine used to run postscripts
# $1 argument is the script type
# rest argument is the script name and arguments
run_ps () {
    local ret_local=0
    mkdir -p "/var/log/xcat"
    # On some Linux distro, the rsyslogd daemon write log files with permision
    # other than root:root. And in some case, the directory /var/log/xcat was
    # created by xCAT, and had root:root ownership. In this way, rsyslogd
    # did not have enough permission to write to log files under this directory.
    # As a dirty hack, change the ownership of directory /var/log/xcat to the
    # same ownership of directory /var/log.
    chown root:root "/var/log/xcat"
    local logfile="/var/log/xcat/xcat.log"
    local scriptype=$1
    shift;

    if [ -z "$scriptype" ]; then
        scriptype="postscript"
    fi
    log_label="xcat.deployment."$scriptype
    if [ -f $1 ]; then
        msgutil_r "$MASTER_IP" "info" "Running $scriptype: $1" "$logfile" "$log_label"
        if [ "$XCATDEBUGMODE" = "1" ] || [ "$XCATDEBUGMODE" = "2" ]; then
            local compt=$(file $1)
            local reg="shell script"
            if [[ "$compt" =~ $reg ]]; then
                bash -x ./$@ 2>&1
                ret_local=$?
            else
                ./$@ 2>&1 | logger -t xcat -p debug
                ret_local=${PIPESTATUS[0]}
            fi
        else
            ./$@ 2>&1
            ret_local=${PIPESTATUS[0]}
        fi

        if [ "$ret_local" -ne "0" ]; then
            return_value=$ret_local
        fi
        msgutil_r "$MASTER_IP" "info" "$scriptype $1 return with $ret_local" "$logfile" "$log_label"
    else
        msgutil_r "$MASTER_IP" "error" "$scriptype $1 does NOT exist." "$logfile" "$log_label"
        return_value=-1
    fi
    return 0
}
# subroutine end

echo xcat.deployment [xcatinstallpost] mypostscript.post MASTER_IP=$MASTER_IP XCATDEBUGMODE=0 MASTER=$MASTER >> /root/post.xcat.log
[ -f /opt/xcat/xcatinfo ] && grep 'POSTSCRIPTS_RC=1' /opt/xcat/xcatinfo >/dev/null 2>&1 && return_value=1
env > /root/env.mypostscript.post
set -x
if [ "$return_value" -eq "0" ]; then
    if [ "$XCATDEBUGMODE" = "1" ] || [ "$XCATDEBUGMODE" = "2" ]; then
        msgutil_r "$MASTER_IP" "debug" "node booted, reporting status..." "/var/log/xcat/xcat.log" "$log_label"
    fi
    updateflag.awk $MASTER 3002 "installstatus booted"
    rc=$?
    echo "xcat.deployment [xcatinstallpost] mypostscript.post updateflag.awk $MASTER 3002 \"installstatus booted\" return with $rc" >> /root/post.xcat.log
    msgutil_r $MASTER_IP "info" "provision completed.($NODE)" "/var/log/xcat/xcat.log" "$log_label"
else
    if [ "$XCATDEBUGMODE" = "1" ] || [ "$XCATDEBUGMODE" = "2" ]; then
        msgutil_r "$MASTER_IP" "debug" "node boot failed, reporting status..." "/var/log/xcat/xcat.log" "$log_label"
    fi
    updateflag.awk $MASTER 3002 "installstatus failed"
    rc=$?
    echo "xcat.deployment [xcatinstallpost] mypostscript.post updateflag.awk $MASTER 3002 \"installstatus failed\" return with $rc" >> /root/post.xcat.log
    msgutil_r $MASTER_IP "error" "provision completed with error.($NODE)" "/var/log/xcat/xcat.log" "$log_label"
fi
cxhong commented 3 years ago

@dombrowa , sorry, was typo, the file /install/autoinst/<nodename> is created via nodeset/rinstall command. It contains deployment flow for this node. The MASTER_IP was available. For the postscripts defined in the osimage, I think they should have /install in the front of custom, right?

postbootscripts=custom/rhels8.2.0-x86_64-install-netsres/compute.postboot
    postscripts=custom/rhels8.2.0-x86_64-install-netsres/compute.postinstall
cxhong commented 3 years ago

if precreatemypostscripts is set to yes/1, it will regenerate the /tftpboot/mypostscript/mypostscript.<nodename>. Normally, we didn't set, the curl will fail if download

curl --fail --retry 20 --max-time 60 "http://$MASTER_IP:${HTTPPORT}$TFTPDIR/mypostscripts/mypostscript.$NODE" -o "/xcatpost/\
mypostscript.$NODE" 2> /tmp/download.log

but it will go on, and use /xcatpost/getpostscript.awk to download the postscripts.

so, check the file /install/autoinst/<nodename> to see if MASTER is unset somewhere. and run xcatprobe osdeploy -n <nodename> after rinstall command

dombrowa commented 3 years ago

Never mind the typo autoinst[all]. It was obvious as I have other xCAT Management nodes to compare to.

As to the curl and awk download: I have added various taps into the code and have observed:

  1. curl only downloads a single script /tftpboot/mypostscript/mypostscript. if available. If it does not find it it will silently emit its error into /tmp/download.log and continue. xcat will not report it as a failure
  2. The /xcatpost/getpostscript.awk kicks in if curl failed. I looked at the code which opens a connection to the xcatmaster. This script returns rc=0 if /tftpboot/mypostscript/mypostscript. exists or not. In my case both methods do not download the file as it has not been generated, but the awk command will create and empty file instead. Why is that file not generated under /tftpboot in RHEL8 but in RHEL7. I do not mind putting a debug/sensors into the xcat perl sources but have not found the code which is responsible for creating /tftpboot/mypostscript/mypostscript. upon rinstall. I tried to locate code as such:
    [root@netsres-xcat work]# find /opt/xcat/ -iname "*.pm" -or -iname '*.pl' -exec grep mypostscript {} \;

The autoinst file does contain the MASTER_IP when I run rinstall RHEL8

[root@netsres-xcat ~]# grep MASTER /install/autoinst/netsres46
export MASTER_IP="172.16.16.1"

The problem remains that even with MASTER_IP, MASTER etc. set the mypostscript.post is missing the export statments and the section to run postboot/postscripts.

awk will always create the file no matter what due to the '>'

   /xcatpost/getpostscript.awk | egrep '<data>' | sed -e 's/<[^>]*>//g' | egrep -v '^ *$' | sed -e 's/^ *//' | sed -e 's/&l\
t;/</g' -e 's/&gt;/>/g' -e 's/&amp;/\&/g' -e 's/&quot;/"/g' -e "s/&apos;/'/g" >/xcatpost/mypostscript

Instead post.xcat.ng greps for MASTER= in /xcatpost/mypostscript.netsres46 which it never finds in all 10 iterations I added some code which logs this behavior below: curl failed, and awk tries 10x to download

xcat.deployment [post.xcat.ng] curl --fail --retry 20 --max-time 60 "http://172.16.16.1:80/tftpboot/mypostscripts/mypostscript.netsres46" -o "/xcatpost/mypostscript.netsres46" 2> /tmp/download.log return with 22
xcat.deployment [post.xcat.ng] precreated mypostscript not downloaded, see /tmp/download.log
xcat.deployment [post.xcat.ng] no pre-generated mypostscript.<nodename>, trying to get it with getpostscript.awk...
xcat.deployment [post.xcat.ng] /xcatpost/getpostscript.awk .. return with 0
xcat.deployment [post.xcat.ng] [1/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [2/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [3/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [4/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [5/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [6/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [7/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [8/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] [9/10] precreated mypostscript exists
xcat.deployment [post.xcat.ng] Missing MASTER in /xcatpost/mypostscript
xcat.deployment [post.xcat.ng] Missing tag "postscripts-start-here" in /xcatpost/mypostscript
xcat.deployment [post.xcat.ng] Missing tag "postscripts-end-here" in /xcatpost/mypostscript
xcat.deployment [post.xcat.ng] Missing tag "postbootscript-start-here" in /xcatpost/mypostscript
xcat.deployment [post.xcat.ng] Missing tag "postbootscript-end-here" in /xcatpost/mypostscript
xcat.deployment [post.xcat.ng] generate mypostscript.post file successfully

I will run another install using the full path for the post*

   postbootscripts=/install/postscripts/custom/rhels8.2.0-x86_64-install-netsres/compute.postboot
    postscripts=/install/postscripts/custom/rhels8.2.0-x86_64-install-netsres/compute.postinstall
cxhong commented 3 years ago

/xcatpost/getpostscript.awk will call /opt/xcat/lib/xcat/plugins/getpostscript.pm , then call makescript in the file /opt/xcat/lib/perl/xCAT/Postage.pm can u check the error message of makescript in the /var/log/xcat/*log ? In the site table , there is no precreatemypostscripts attributes, right?

cxhong commented 3 years ago

do u have /install/postscripts/mypostscript.tmpl file available on your system? if you do, can u get rid of it and set precreatemypostscripts attribute to 0 in the site table.

dombrowa commented 3 years ago

MN=Management Node (in my case the MASTER or xCAT server, all as one node)

  1. /opt/xcat/lib/xcat/plugins/getpostscript.pm does not exist on the MN or any of my cluster nodes but as /opt/xcat/lib/perl/xCAT_plugin/getpostscript.pm on the MN

  2. /install/postscripts/mypostscript.tmpl exists on my MN as /opt/xcat/share/xcat/mypostscript/mypostscript.tmpl and

    [root@netsres-xcat ~]# lsdef -t site -i precreatemypostscripts
    Object name: clustersite
    precreatemypostscripts=0

    which should satisfy your requirements. This has not been changed.

With these settings the osdeploy log shows:

[root@netsres-xcat Downloads]# xcatprobe osdeploy -n netsres46 2>&1| tee ~/work/netsres46.osdeploy.2
....
[netsres46] 12:49:02 Via HTTP get /install/postscripts/xcatserver
[netsres46] 12:49:02 Via HTTP get /tftpboot/mypostscripts/mypostscript....
[netsres46] 12:51:32 Via HTTP get /tftpboot/xcat/xnba/nodes/netsres46
[netsres46] 13:02:43 Via HTTP get //install/rhels8.2.0/x86_64/AppStream...
[netsres46] 13:02:43 Via HTTP get //install/rhels8.2.0/x86_64/AppStream...
[netsres46] 13:02:43 Via HTTP get //install/rhels8.2.0/x86_64/AppStream...
[netsres46] 13:02:43 Via HTTP get //install/rhels8.2.0/x86_64/AppStream...
[netsres46] 13:02:43 Via HTTP get //install/rhels8.2.0/x86_64/AppStream...
[netsres46] 13:02:43 Via HTTP get //install/rhels8.2.0/x86_64/AppStream...
[netsres46] 13:02:44 Via HTTP get //install/rhels8.2.0/x86_64/BaseOS/re...
[netsres46] 13:02:44 Via HTTP get //install/rhels8.2.0/x86_64/BaseOS/re...
[netsres46] 13:02:44 Via HTTP get //install/rhels8.2.0/x86_64/BaseOS/re...
[netsres46] 13:02:44 Via HTTP get //install/rhels8.2.0/x86_64/BaseOS/re...
[netsres46] 13:02:44 Via HTTP get //install/rhels8.2.0/x86_64/BaseOS/re...
60 minutes have expired, stop monitoring                                  [INFO]
======================  Summary  =====================
There is 1 node provision failures
netsres46 : stop at stage 'start_to_install_os_package'                   [FAIL]

and syslog shows:

...
Sep 30 16:32:23 netsres46 xcat.deployment Generate the repository for the installation
Sep 30 12:37:53 netsres46 xcat.deployment [post.xcat.ng] Executing post.xcat to prepare for firstbooting ...
Sep 30 12:38:33 netsres46 xcat.deployment [post.xcat.ng] trying to download postscripts from 172.16.16.1...
Sep 30 12:38:35 netsres46 xcat.deployment [post.xcat.ng] postscripts downloaded successfully
Sep 30 12:38:35 netsres46 xcat.deployment [post.xcat.ng] trying to get mypostscript from 172.16.16.1...
Sep 30 12:38:35 netsres46 xcat.deployment [post.xcat.ng] failed to download precreated mypostscript
Sep 30 12:40:53 netsres46 xcat.deployment [post.xcat.ng] finished firstboot preparation, sending request to 172.16.16.1:3002 for changing status...
Sep 30 12:41:57 netsres46 xcat.deployment [xcatinstallpost] Running /xcatpost/mypostscript.post
Sep 30 12:41:57 netsres46 xcat provision completed.(netsres46)
Sep 30 12:41:57 netsres46 xcat.deployment [xcatinstallpost] /xcatpost/mypostscript.post return
Sep 30 12:41:57 netsres46 xcat.deployment [xcatinstallpost] =============deployment ending====================
cxhong commented 3 years ago

sorry, it is /opt/xcat/lib/perl/xCAT_plugin/getpostscript.pm and /opt/xcat/lib/perl/xCAT/Postage.pm

what's in the /install/postscripts/mypostscript.tmpl? this file is created if precreatemypostscripts=1, what's the timestamp? Can you remove the file /install/postscripts/mypostscript.tmpl then run rinstall again?

The timestamp between xcatprobe command and syslog is different, and syslog showed deployment ending, but osdeploy stuck on the installation of packages?

dombrowa commented 3 years ago
  1. I see that the getpostscript.awk submits <command>getpostscript</command>" upon (I assume) the xcat server runs /opt/xcat/lib/perl/xCAT_plugin/getpostscript.pm

  2. As to the syntax for postscripts: When I run with full path in osimage table for post*scripts, I see this error

    netsres46: Wed Sep 30 14:22:39 EDT 2020 postscript /install/postscripts/custom/rhels8.2.0-x86_64-install-netsres/compute.postboot does NOT exist.

    Since this message is prefixed with <nodename> I believe the full path is incorrect and should remain relative

    custom/rhels8.2.0-x86_64-install-netsres/compute.postboot

    e.g. to /xcatpost on the node

3 . Setting 'precreatemypostscripts=1and the running rinstall this file appears: /tftpboot/mypostscripts/mypostscript.netsres46` Here its contents: mypostscript.netsres46.gz but not this anymore:

[root@netsres-xcat ~]# ls /install/postscripts/mypostscript.tmpl
ls: cannot access /install/postscripts/mypostscript.tmpl: No such file or directory

When I switch back to precreatemypostscripts=0 this file /tftpboot/mypostscripts/mypostscript.netsres46 disappears So it is not clear what and when this file /install/postscripts/mypostscript.tmpl

I will have to check regarding the timestamp as both should be from the MN, correct and in sync regardless if the node has a time offset due to incorrect ntp?

cxhong commented 3 years ago

Both MASTER and MASTER_IP are defined in the mypostscript.netsres46.gz.
I think from previous post the MASTER is also available in the install/autoinst/netsres46 are there some postscripts unset the ENV?

cxhong commented 3 years ago

@dombrowa , what's the OS for xCATmn?

dombrowa commented 3 years ago

The xcat MN is a VM running

cat[root@netsres-xcat ~]# cat /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="OpenShift Enterprise"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
jjohnson42 commented 3 years ago

FYI, I have encountered two items that cause similar behavior: -Ubuntu install - If the installed image has non-gawk awk, then it turns out like this -If site table contains xcatsslversion or xcatsslciphers that disable newer ciphers by mistake, this happens

Note that I delete the use of 'nice' as randbytes, as it is a bad idea.

In this case, running 'getpostscript.awk' is the most direct way of seeing what is going awry.

thiell commented 3 years ago

Thanks @jjohnson42, I had the same problem with xCAT 2.16.1 when deploying CentOS 8.2 nodes, no variables were defined in /xcatpost/mypostscript including $MASTER_IP, and simply removing the xcatsslversion definition from site table fixed the problem.