xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
368 stars 172 forks source link

(already) Provisioned nodes are failing to boot off disk #7030

Open techie879 opened 3 years ago

techie879 commented 3 years ago

Hi,

Need your helpful thoughts here with a problem we have, please.

We have nodes that were provisioned with xcat, they are running, OS is working and installed. The boot order is set to PXE first, SSD 2nd.

Several days ago, when I rebooted one of the nodes, it went straight to PXE discovery mode - attempting for an install. This is a node that is built, it should have exited the PXE boot mode and boot off the disk, but it never did.

I am not sure what's going on, it looks like xcat has lost the status of the node, whether it is installed or not ( need provisioning?)

Here is the 'lsdef' output of the node:

[root@mn] lsdef -t node hpc3-14-03
Object name: hpc3-14-03
    arch=x86_64
    cpucount=40
    cputype=Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
    currchain=boot
    currstate=boot
    disksize=sda:224GB,sdb:224GB
    groups=centos78
    ip=10.240.58.16
    mac=08:f1:ea:e4:35:52
    memory=193122MB
    mtm=HPE:ProLiant XL170r Gen10
    netboot=xnba
    nichostnamesuffixes.ib0=-ib0
    nichostnamesuffixes.ipmi=-ipmi
    nicips.ib0=10.240.60.16
    nicips.ipmi=10.240.62.16
    os=centos7.7
    postbootscripts=otherpkgs,hpc3-postscripts/hpc3postbootscript
    postscripts=syslog,remoteshell,syncfiles,setupntp,hpc3-postscripts/hpc3postscript.1,confignetwork -s
    profile=compute
    provmethod=centos7.8-x86_64-install-compute
    serial=2M294204L9
    status=booted
    statustime=06-21-2021 16:41:45
    supportedarchs=x86,x86_64
[root@mn]# nodediscoverls |grep 14-03 
  38363730-3535-324D-3239-343230344C39    hpc3-14-03          manual         HPE:ProLiant XL170r Gen10 2M294204L9
[root@mn]# lsdef -t network compute_net_1                                                             
Object name: compute_net_1                                                                                             
    domain=local                                                                                                       
    dynamicrange=10.240.58.221-10.240.58.240                                                                           
    gateway=10.240.58.1                                                                                                
    mask=255.255.254.0
    mgtifname=eno1
    mtu=1500
    nameservers=10.240.58.4,8.8.8.8,128.200.192.202
    net=10.240.58.0
    staticrange=10.240.58.4-10.240.59.220
    tftpserver=<xcatmaster>

Any idea what might be going on here? Why an already setup/installed node is going back to discovery ( and wanting to be installed) mode?

Can someone please shed some light?

thanks a lot!

gurevichmark commented 3 years ago

Check contents of /tftpboot/xcat/xnba/nodes/<nodename>. It should contain something like:

#!gpxe
#boot
exit
techie879 commented 3 years ago

Hi Mark,

that entry exists.

@.# cat /tftpboot/xcat/xnba/nodes/ Display all 732 possibilities? (y or n) @. dhcpd]# cat /tftpboot/xcat/xnba/nodes/hpc3-14-03

!gpxe

boot

exit

What I have noticed is that dhcpd.leases file does not have an entry with an "fixed address" entry like the following:

host hpc3-gpu-16-03 { dynamic; hardware ethernet 20:67:7c:10:ba:86; uid 20:67:7c:10:ba:86; fixed-address 10.240.58.61; supersede server.ddns-hostname = "hpc3-gpu-16-03"; supersede host-name = "hpc3-gpu-16-03"; if option user-class-identifier = "xNBA" and option client-architecture = 00:00 { supersede server.filename = "http:// ${next-server}:80/tftpboot/xcat/xnba/nodes/hpc3-gpu-16-03"; } elsif option client-architecture = 00:00 { supersede server.filename = "xcat/xnba.kpxe"; } else { supersede server.filename = ""; } }

I am assuming the lease file got messed up somehow. What are your thoughts on reconstructing the file (programmatically) and using a modified file? Or is there another way from within xcat to add entries in the dhcpd leases file?

thanks

On Wed, Sep 1, 2021 at 9:23 AM Mark Gurevich @.***> wrote:

Check contents of /tftpboot/xcat/xnba/nodes/. It should contain something like:

!gpxe

boot

exit

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/7030#issuecomment-910445948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFYYS4YP63SXF72GD24LV53T7ZHPJANCNFSM5DG2CQVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Regards, Imam Toufique 213-700-5485

gurevichmark commented 3 years ago

You can run makedhcp <node> to add the node entry to the dhcpd.leases file. Then restart dhcpd

techie879 commented 3 years ago

Thank you so much. It did not occur to me that makedhcp would all the node(s) back to the lease file. I usually get a bit confused when I read the documentation on makedhcp and makedns commands.

On Wed, Sep 1, 2021 at 10:26 AM Mark Gurevich @.***> wrote:

You can run makedhcp to add the node entry to the dhcpd.leases file

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/7030#issuecomment-910496351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFYYS47JJRYP6LBUFQXSXGDT7ZO4BANCNFSM5DG2CQVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Regards, Imam Toufique 213-700-5485