xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
363 stars 171 forks source link

Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade #6815

Closed pcmc closed 4 years ago

pcmc commented 4 years ago

Having run XCAT 2.16 smoothly on a small cluster of 4 compute nodes for about 18 months, I have hit a problem on the last Centos 7 system update from 3.10.0-1127.13.1 to 3.10.0-1127.18.2 that resulted in the clients not booting up with this error:

CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA
CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140
GATEWAY IP: 130.246.32.254

PXE Boot aborted. Booting to next device...
PXE-M0F: Exiting Intel Boot Agent

In the previous Centos 7 updates, there have been no such problems in restarting the clients with the updated kernels.

I have made quite a number of checks as described below but have not been able to pin point the fault. I would appreciate if anyone has any idea what could be the issue.

Many thanks. Peter Chiu

Below are some details on the systems:

Master node: main.bnsc.rl.ac.uk 130.246.32.140/22 gateway 130.246.32.254 Compute node1: proc01.bnsc.rl.ac.uk 130.246.32.141/22 00:25:90:5a:eb:8a Operating system: cat /etc/centos-release CentOS Linux release 7.8.2003 (Core) [root@main dhcp]# rpm -qf /opt/xcat/sbin/xcatd xCAT-server-2.16-snap202006161607.noarch

Checks:

  1. /var/log/messages for dhcp messages, no error.

The master server has picked up the dhcp requests, and offered the address. But no further communication afterwards.

Aug 4 15:00:27 main dhcpd: DHCPDISCOVER from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:27 main dhcpd: DHCPOFFER on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: Dynamic and static leases present for 130.246.32.141. Aug 4 15:00:29 main dhcpd: Remove host declaration proc01 or remove 130.246.32.141 Aug 4 15:00:29 main dhcpd: from the dynamic address pool for bond0 Aug 4 15:00:29 main dhcpd: DHCPREQUEST for 130.246.32.141 (130.246.32.140) from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: DHCPACK on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0

  1. /var/log/xcat/cluster.log

No errors, just a record of a new image produced.

Aug 4 14:24:54 main xcat[28101]: INFO xCAT: Allowing lsdef -t site -o clustersite -i installdir for root from localhost Aug 4 14:24:54 main xcat[28103]: INFO xCAT: Allowing genimage -i eth0 -n dca,ixgbe,igb,e1000e,e1000,tg3 -o centos7.6 -p compute --tempfile /tmp/xcat_genimage.28086 for root from localhost Aug 4 14:27:29 main xcat[25483]: INFO xCAT: Allowing packimage centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:27:30 main xcat[25499]: INFO xCAT: Allowing ilitefile centos7.6-x86_64-statelite-compute for root from localhost Aug 4 14:30:07 main xcat[26073]: INFO xCAT: Allowing nodeset to compute osimage=centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:34:33 main xcat[26958]: INFO xCAT: Allowing rpower to compute reset for root from localhost Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc03: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc04: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc01: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc02: changing status=powering-on

  1. Check dhcp lease file for the files to be downloaded:

less /var/lib/dhcpd/dhcpd.leases host proc01.bnsc.rl.ac.uk { deleted; } host proc04.bnsc.rl.ac.uk { deleted; } host proc01 { dynamic; hardware ethernet 00:25:90:5a:eb:8a; uid 00:25:90:5a:eb:8a; fixed-address 130.246.32.141; supersede server.ddns-hostname = "proc01"; supersede host-name = "proc01"; if option user-class-identifier = "xNBA" and option client-architecture = 00:00 { supersede server.always-broadcast = 01; supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01"; } elsif option user-class-identifier = "xNBA" and option client-architecture = 00:09 { supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01.uefi"; } elsif option client-architecture = 00:07 { supersede server.filename = "xcat/xnba.efi"; } elsif option client-architecture = 00:00 { supersede server.filename = "xcat/xnba.kpxe"; } else { supersede server.filename = ""; } }

Follow through this list to download the files on a separate Centos server.

a. tftp 130.236.32.140 [root@cds1 xcat]# tftp 130.246.32.140 tftp> get xcat/xnba.kpxe tftp> get xcat/xnba.efi tftp> get yaboot tftp> get xcat/xnba/nets/130.246.32.0_22 tftp> get xcat/xnba/nets/130.246.32.0_22.uefi tftp> quit [root@cds1 xcat]# ls 130.246.32.0_22 130.246.32.0_22.uefi elilo.efi xnba.efi xnba.kpxe yaboot [root@cds1 xcat]# ls -ls total 536 4 -rw-r--r-- 1 root root 252 Aug 4 09:46 130.246.32.0_22 4 -rw-r--r-- 1 root root 116 Aug 4 09:46 130.246.32.0_22.uefi 0 -rw-r--r-- 1 root root 0 Aug 4 09:45 elilo.efi 140 -rw-r--r-- 1 root root 139169 Aug 4 09:45 xnba.efi 80 -rw-r--r-- 1 root root 74786 Aug 4 09:45 xnba.kpxe 308 -rw-r--r-- 1 root root 310187 Aug 4 09:46 yaboot

b. use wget to download the node start up file wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01

root@cds1 xcat]# wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01 --2020-08-04 11:57:18-- http://130.246.32.140/tftpboot/xcat/xnba/nodes/proc01 Connecting to 130.246.32.140:80... connected. HTTP request sent, awaiting response... 200 OK Length: 528 Saving to: `proc01'

100%[======================================>] 528 --.-K/s in 0s

2020-08-04 11:57:18 (85.2 MB/s) - `proc01' saved [528/528]

This file in turn contains the instructions to download the kernel and ramdisk [root@cds1 xcat]# less proc01

!gpxe

netboot centos7.6-x86_64-compute

imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel imgload kernel imgargs kernel imgurl=http://130.246.32.140:80//install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gz XCAT=130.246.32.140:3001 NODE=proc01 FC=yes XCATHTTPPORT=80 netdev=eth0 selinux=0 biosdevname=0 net.ifnames=0 BOOTIF=01-${netX/machyp} imgfetch http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz imgexec kernel

Both the kernel and ramdisk can also be downloaded using wget command.

The problem appears to be on the PXE downloaded images, but not sure which one. PXE-Boot-Aborted-20200804.docx

jjohnson42 commented 4 years ago

Nodeset stat

Also, is tftp still running and/or has a firewall rule blocking tftp come back?

From: pcmc notifications@github.com Sent: Tuesday, August 4, 2020 11:21 AM To: xcat2/confluent confluent@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [External] [xcat2/confluent] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#109)

Having run XCAT 2.16 smoothly on a small cluster of 4 compute nodes for about 18 months, I have hit a problem on the last Centos 7 system update from 3.10.0-1127.13.1 to 3.10.0-1127.18.2 that resulted in the clients not booting up with this error:

CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA

CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140

GATEWAY IP: 130.246.32.254

PXE Boot aborted. Booting to next device...

PXE-M0F: Exiting Intel Boot Agent

In the previous Centos 7 updates, there have been no such problems in restarting the clients with the updated kernels.

I have made quite a number of checks as described below but have not been able to pin point the fault. I would appreciate if anyone has any idea what could be the issue.

Many thanks. Peter Chiu

Below are some details on the systems:

Master node: main.bnsc.rl.ac.uk 130.246.32.140/22 gateway 130.246.32.254 Compute node1: proc01.bnsc.rl.ac.uk 130.246.32.141/22 00:25:90:5a:eb:8a Operating system: cat /etc/centos-release CentOS Linux release 7.8.2003 (Core) [root@main dhcp]# rpm -qf /opt/xcat/sbin/xcatd xCAT-server-2.16-snap202006161607.noarch

Checks:

  1. /var/log/messages for dhcp messages, no error.

The master server has picked up the dhcp requests, and offered the address. But no further communication afterwards.

Aug 4 15:00:27 main dhcpd: DHCPDISCOVER from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:27 main dhcpd: DHCPOFFER on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: Dynamic and static leases present for 130.246.32.141. Aug 4 15:00:29 main dhcpd: Remove host declaration proc01 or remove 130.246.32.141 Aug 4 15:00:29 main dhcpd: from the dynamic address pool for bond0 Aug 4 15:00:29 main dhcpd: DHCPREQUEST for 130.246.32.141 (130.246.32.140) from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: DHCPACK on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0

  1. /var/log/xcat/cluster.log

No errors, just a record of a new image produced.

Aug 4 14:24:54 main xcat[28101]: INFO xCAT: Allowing lsdef -t site -o clustersite -i installdir for root from localhost Aug 4 14:24:54 main xcat[28103]: INFO xCAT: Allowing genimage -i eth0 -n dca,ixgbe,igb,e1000e,e1000,tg3 -o centos7.6 -p compute --tempfile /tmp/xcat_genimage.28086 for root from localhost Aug 4 14:27:29 main xcat[25483]: INFO xCAT: Allowing packimage centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:27:30 main xcat[25499]: INFO xCAT: Allowing ilitefile centos7.6-x86_64-statelite-compute for root from localhost Aug 4 14:30:07 main xcat[26073]: INFO xCAT: Allowing nodeset to compute osimage=centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:34:33 main xcat[26958]: INFO xCAT: Allowing rpower to compute reset for root from localhost Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc03: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc04: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc01: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc02: changing status=powering-on

  1. Check dhcp lease file for the files to be downloaded:

less /var/lib/dhcpd/dhcpd.leases host proc01.bnsc.rl.ac.uk { deleted; } host proc04.bnsc.rl.ac.uk { deleted; } host proc01 { dynamic; hardware ethernet 00:25:90:5a:eb:8a; uid 00:25:90:5a:eb:8a; fixed-address 130.246.32.141; supersede server.ddns-hostname = "proc01"; supersede host-name = "proc01"; if option user-class-identifier = "xNBA" and option client-architecture = 00:00 { supersede server.always-broadcast = 01; supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01http://$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01"; } elsif option user-class-identifier = "xNBA" and option client-architecture = 00:09 { supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01.uefihttp://$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01.uefi"; } elsif option client-architecture = 00:07 { supersede server.filename = "xcat/xnba.efi"; } elsif option client-architecture = 00:00 { supersede server.filename = "xcat/xnba.kpxe"; } else { supersede server.filename = ""; } }

Follow through this list to download the files on a separate Centos server.

a. tftp 130.236.32.140 [root@cds1 xcat]# tftp 130.246.32.140 tftp> get xcat/xnba.kpxe tftp> get xcat/xnba.efi tftp> get yaboot tftp> get xcat/xnba/nets/130.246.32.0_22 tftp> get xcat/xnba/nets/130.246.32.0_22.uefi tftp> quit [root@cds1 xcat]# ls 130.246.32.0_22 130.246.32.0_22.uefi elilo.efi xnba.efi xnba.kpxe yaboot [root@cds1 xcat]# ls -ls total 536 4 -rw-r--r-- 1 root root 252 Aug 4 09:46 130.246.32.0_22 4 -rw-r--r-- 1 root root 116 Aug 4 09:46 130.246.32.0_22.uefi 0 -rw-r--r-- 1 root root 0 Aug 4 09:45 elilo.efi 140 -rw-r--r-- 1 root root 139169 Aug 4 09:45 xnba.efi 80 -rw-r--r-- 1 root root 74786 Aug 4 09:45 xnba.kpxe 308 -rw-r--r-- 1 root root 310187 Aug 4 09:46 yaboot

b. use wget to download the node start up file wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01

root@cds1 xcat]# wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01 --2020-08-04 11:57:18-- http://130.246.32.140/tftpboot/xcat/xnba/nodes/proc01 Connecting to 130.246.32.140:80... connected. HTTP request sent, awaiting response... 200 OK Length: 528 Saving to: `proc01'

100%[======================================>] 528 --.-K/s in 0s

2020-08-04 11:57:18 (85.2 MB/s) - `proc01' saved [528/528]

This file in turn contains the instructions to download the kernel and ramdisk [root@cds1 xcat]# less proc01

!gpxe

netboot centos7.6-x86_64-compute

imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernelhttp://$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel imgload kernel imgargs kernel imgurl=http://130.246.32.140:80//install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gzhttp://130.246.32.140:80/install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gz XCAT=130.246.32.140:3001 NODE=proc01 FC=yes XCATHTTPPORT=80 netdev=eth0 selinux=0 biosdevname=0 net.ifnames=0 BOOTIF=01-${netX/machyp} imgfetch http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gzhttp://$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz imgexec kernel

Both the kernel and ramdisk can also be downloaded using wget command.

The problem appears to be on the PXE downloaded images, but not sure which one. PXE-Boot-Aborted-20200804.docxhttps://github.com/xcat2/confluent/files/5022846/PXE-Boot-Aborted-20200804.docx

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/xcat2/confluent/issues/109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACSQIGV67ZFHW5BWETXAQQ3R7ARM7ANCNFSM4PUQOGNQ.

pcmc commented 4 years ago

Hi,

  1. Nodeset stat

    [root@main netboot]# nodeset compute stat proc01: netboot centos7.6-x86_64-compute proc02: netboot centos7.6-x86_64-compute proc03: netboot centos7.6-x86_64-compute proc04: netboot centos7.6-x86_64-compute

  2. tftp is still running and accepting transfers [root@main netboot]# ps auxw | grep ftp

    root 27941 0.0 0.0 11004 152 ? Ss 14:24 0:00 /usr/sbin/in.tftpd -v -l -s /tftpboot -m /etc/tftpmapfile4xcat.conf

    I have earlier tested by fetching files using tftp from a separate host:

    [root@cds1 xcat]# tftp 130.246.32.140 tftp> get xcat/xnba.kpxe … tftp> quit

    ls  -ls          # to confirm files being downloaded with tftp from the master:
    80 -rw-r--r-- 1 root root 74786 Aug 4 09:45 xnba.kpxe
  3. Firewall disabled [root@main netboot]# systemctl status firewalld

    • firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: man:firewalld(1) [root@main netboot]#

Any further thoughts?

Regards, Peter

From: Jarrod Johnson notifications@github.com Sent: 04 August 2020 17:19 To: xcat2/confluent confluent@noreply.github.com Cc: Chiu, Peter (STFC,RAL,RALSP) peter.chiu@stfc.ac.uk; Author author@noreply.github.com Subject: Re: [xcat2/confluent] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#109)

Nodeset stat

Also, is tftp still running and/or has a firewall rule blocking tftp come back?

From: pcmc notifications@github.com<mailto:notifications@github.com> Sent: Tuesday, August 4, 2020 11:21 AM To: xcat2/confluent confluent@noreply.github.com<mailto:confluent@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: [External] [xcat2/confluent] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#109)

Having run XCAT 2.16 smoothly on a small cluster of 4 compute nodes for about 18 months, I have hit a problem on the last Centos 7 system update from 3.10.0-1127.13.1 to 3.10.0-1127.18.2 that resulted in the clients not booting up with this error:

CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA

CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140

GATEWAY IP: 130.246.32.254

PXE Boot aborted. Booting to next device...

PXE-M0F: Exiting Intel Boot Agent

In the previous Centos 7 updates, there have been no such problems in restarting the clients with the updated kernels.

I have made quite a number of checks as described below but have not been able to pin point the fault. I would appreciate if anyone has any idea what could be the issue.

Many thanks. Peter Chiu

Below are some details on the systems:

Master node: main.bnsc.rl.ac.uk 130.246.32.140/22 gateway 130.246.32.254 Compute node1: proc01.bnsc.rl.ac.uk 130.246.32.141/22 00:25:90:5a:eb:8a Operating system: cat /etc/centos-release CentOS Linux release 7.8.2003 (Core) [root@main dhcp]# rpm -qf /opt/xcat/sbin/xcatd xCAT-server-2.16-snap202006161607.noarch

Checks:

  1. /var/log/messages for dhcp messages, no error.

The master server has picked up the dhcp requests, and offered the address. But no further communication afterwards.

Aug 4 15:00:27 main dhcpd: DHCPDISCOVER from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:27 main dhcpd: DHCPOFFER on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: Dynamic and static leases present for 130.246.32.141. Aug 4 15:00:29 main dhcpd: Remove host declaration proc01 or remove 130.246.32.141 Aug 4 15:00:29 main dhcpd: from the dynamic address pool for bond0 Aug 4 15:00:29 main dhcpd: DHCPREQUEST for 130.246.32.141 (130.246.32.140) from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: DHCPACK on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0

  1. /var/log/xcat/cluster.log

No errors, just a record of a new image produced.

Aug 4 14:24:54 main xcat[28101]: INFO xCAT: Allowing lsdef -t site -o clustersite -i installdir for root from localhost Aug 4 14:24:54 main xcat[28103]: INFO xCAT: Allowing genimage -i eth0 -n dca,ixgbe,igb,e1000e,e1000,tg3 -o centos7.6 -p compute --tempfile /tmp/xcat_genimage.28086 for root from localhost Aug 4 14:27:29 main xcat[25483]: INFO xCAT: Allowing packimage centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:27:30 main xcat[25499]: INFO xCAT: Allowing ilitefile centos7.6-x86_64-statelite-compute for root from localhost Aug 4 14:30:07 main xcat[26073]: INFO xCAT: Allowing nodeset to compute osimage=centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:34:33 main xcat[26958]: INFO xCAT: Allowing rpower to compute reset for root from localhost Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc03: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc04: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc01: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc02: changing status=powering-on

  1. Check dhcp lease file for the files to be downloaded:

less /var/lib/dhcpd/dhcpd.leases host proc01.bnsc.rl.ac.uk { deleted; } host proc04.bnsc.rl.ac.uk { deleted; } host proc01 { dynamic; hardware ethernet 00:25:90:5a:eb:8a; uid 00:25:90:5a:eb:8a; fixed-address 130.246.32.141; supersede server.ddns-hostname = "proc01"; supersede host-name = "proc01"; if option user-class-identifier = "xNBA" and option client-architecture = 00:00 { supersede server.always-broadcast = 01; supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01http://$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01http://$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01%3chttp:/$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01%3e"; } elsif option user-class-identifier = "xNBA" and option client-architecture = 00:09 { supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01.uefihttp://$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01.uefihttp://$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01.uefi%3chttp:/$%7bnext-server%7d:80/tftpboot/xcat/xnba/nodes/proc01.uefi%3e"; } elsif option client-architecture = 00:07 { supersede server.filename = "xcat/xnba.efi"; } elsif option client-architecture = 00:00 { supersede server.filename = "xcat/xnba.kpxe"; } else { supersede server.filename = ""; } }

Follow through this list to download the files on a separate Centos server.

a. tftp 130.236.32.140 [root@cds1 xcat]# tftp 130.246.32.140 tftp> get xcat/xnba.kpxe tftp> get xcat/xnba.efi tftp> get yaboot tftp> get xcat/xnba/nets/130.246.32.0_22 tftp> get xcat/xnba/nets/130.246.32.0_22.uefi tftp> quit [root@cds1 xcat]# ls 130.246.32.0_22 130.246.32.0_22.uefi elilo.efi xnba.efi xnba.kpxe yaboot [root@cds1 xcat]# ls -ls total 536 4 -rw-r--r-- 1 root root 252 Aug 4 09:46 130.246.32.0_22 4 -rw-r--r-- 1 root root 116 Aug 4 09:46 130.246.32.0_22.uefi 0 -rw-r--r-- 1 root root 0 Aug 4 09:45 elilo.efi 140 -rw-r--r-- 1 root root 139169 Aug 4 09:45 xnba.efi 80 -rw-r--r-- 1 root root 74786 Aug 4 09:45 xnba.kpxe 308 -rw-r--r-- 1 root root 310187 Aug 4 09:46 yaboot

b. use wget to download the node start up file wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01

root@cds1 xcat]# wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01 --2020-08-04 11:57:18-- http://130.246.32.140/tftpboot/xcat/xnba/nodes/proc01 Connecting to 130.246.32.140:80... connected. HTTP request sent, awaiting response... 200 OK Length: 528 Saving to: `proc01'

100%[======================================>] 528 --.-K/s in 0s

2020-08-04 11:57:18 (85.2 MB/s) - `proc01' saved [528/528]

This file in turn contains the instructions to download the kernel and ramdisk [root@cds1 xcat]# less proc01

!gpxe

netboot centos7.6-x86_64-compute

imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernelhttp://$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel<http://$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel%3chttp:/$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel> imgload kernel imgargs kernel imgurl=http://130.246.32.140:80//install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gzhttp://130.246.32.140:80/install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gz XCAT=130.246.32.140:3001 NODE=proc01 FC=yes XCATHTTPPORT=80 netdev=eth0 selinux=0 biosdevname=0 net.ifnames=0 BOOTIF=01-${netX/machyp} imgfetch http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gzhttp://$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz<http://$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz%3chttp:/$%7bnext-server%7d:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz> imgexec kernel

Both the kernel and ramdisk can also be downloaded using wget command.

The problem appears to be on the PXE downloaded images, but not sure which one. PXE-Boot-Aborted-20200804.docxhttps://github.com/xcat2/confluent/files/5022846/PXE-Boot-Aborted-20200804.docx

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/xcat2/confluent/issues/109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACSQIGV67ZFHW5BWETXAQQ3R7ARM7ANCNFSM4PUQOGNQ.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcat2/confluent/issues/109#issuecomment-668691932, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZKK4DDI2XN3OZQYPDMPATR7AYHRANCNFSM4PUQOGNQ.

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI.

jjohnson42 commented 4 years ago

Sorry about not looking at it... looks like the mac is wrong?

00:25:90:5a:eb:8a versus 00 25 90 5A EB BA

pcmc commented 4 years ago

Thanks for your reply, but don’t that is the cause.

The compute node does the MAC: 00:25:90:5a:eb:8a

This is confirmed with tabdump mac [root@main ~]# tabdump mac

node,interface,mac,comments,disable

"proc01",,"00:25:90:5a:eb:8a",, "proc02",,"00:25:90:5a:eb:f2",, "proc03",,"00:25:90:5a:eb:a2",, "proc04",,"00:25:90:5a:eb:d0",, [root@main ~]#

The /var/log/messages also record: Aug 26 14:41:07 main dhcpd: DHCPOFFER on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0 Aug 26 14:41:09 main dhcpd: Dynamic and static leases present for 130.246.32.141. Aug 26 14:41:09 main dhcpd: Remove host declaration proc01 or remove 130.246.32.141 Aug 26 14:41:09 main dhcpd: DHCPREQUEST for 130.246.32.141 (130.246.32.140) from 00:25:90:5a:eb:8a via bond0 Aug 26 14:41:09 main dhcpd: DHCPACK on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0

So the server receives the request from 00:25:90:5a:eb:8a and offered the IP address 130.246.32.141

But as you can see from the console snapshot, it got “PXE Boot aborted”. Looks like something incorrect has creped in. The MAC address does look okay to me.

Peter

From: Jarrod Johnson notifications@github.com Sent: 26 August 2020 15:56 To: xcat2/confluent confluent@noreply.github.com Cc: Chiu, Peter (STFC,RAL,RALSP) peter.chiu@stfc.ac.uk; Author author@noreply.github.com Subject: Re: [xcat2/confluent] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#109)

Sorry about not looking at it... looks like the mac is wrong?

00:25:90:5a:eb:8a versus 00 25 90 5A EB BA

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcat2/confluent/issues/109#issuecomment-680932334, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZKK4CQJV6KAD563NG2GT3SCUPBFANCNFSM4PUQOGNQ.

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI.

jjohnson42 commented 4 years ago

Do you understand why the firmware would have said: CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA

CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140

GATEWAY IP: 130.246.32.254

Looks like the dhcp server sees the mac address as 8a, but the firmware prints BA which is odd.

In any event, based on the dhcp message about multiple entries for the address, I would probably stop dhcpd and edit dhcpd.leases and delete the lease entry that descripbes 130.246.32.141 and have only the host entry for the node.

pcmc commented 4 years ago

Hello Jarrod,

I wonder where do you see the MAC address that says: 00 25 90 5A EB BA

In the screenshot that I posted in the last mail, I am sorry but I think it does says it ends with 8A, not BA. I have attached the latest screenshot here.

I have also attached a copy of /var/lib/dhcpd/dhcpd.leases. What entry you think I should remove? Proc01 – proc04 are compute nodes. They all fail to boot up with the same PXE Boot aborted error.

Peter

From: Jarrod Johnson notifications@github.com Sent: 26 August 2020 17:01 To: xcat2/xcat-core xcat-core@noreply.github.com Cc: Chiu, Peter (STFC,RAL,RALSP) peter.chiu@stfc.ac.uk; Author author@noreply.github.com Subject: Re: [xcat2/xcat-core] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#6815)

Do you understand why the firmware would have said: CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA

CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140

GATEWAY IP: 130.246.32.254

Looks like the dhcp server sees the mac address as 8a, but the firmware prints BA which is odd.

In any event, based on the dhcp message about multiple entries for the address, I would probably stop dhcpd and edit dhcpd.leases and delete the lease entry that descripbes 130.246.32.141 and have only the host entry for the node.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcat2/xcat-core/issues/6815#issuecomment-680971002, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZKK4GLPMLLFURDPL4CYO3SCUWTXANCNFSM4QL7RZAQ.

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI.

pcmc commented 4 years ago

Dear all,

Just an update on this issue I raised. The problem persists even after I revert the system back to an older Centos 7 release. Hence the PXE boot aborted error is not relating to the Centos 7 3.10.0-1127.19.1 update.

Instead it was found to be in conflict with an IP Helper WDSNBP introduced into the network switch fabric by our network admin a few months back.

Once this WDSNBP DHCP service is stopped, the xCAT nodes boot up fine over PXE.

Peter From: Chiu, Peter (STFC,RAL,RALSP) peter.chiu@stfc.ac.uk Sent: 26 August 2020 17:48 To: xcat2/xcat-core reply@reply.github.com; xcat2/xcat-core xcat-core@noreply.github.com Cc: Author author@noreply.github.com; Chiu, Peter (STFC,RAL,RALSP) peter.chiu@stfc.ac.uk Subject: RE: [xcat2/xcat-core] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#6815)

Hello Jarrod,

I wonder where do you see the MAC address that says: 00 25 90 5A EB BA

In the screenshot that I posted in the last mail, I am sorry but I think it does says it ends with 8A, not BA. I have attached the latest screenshot here.

I have also attached a copy of /var/lib/dhcpd/dhcpd.leases. What entry you think I should remove? Proc01 – proc04 are compute nodes. They all fail to boot up with the same PXE Boot aborted error.

Peter

From: Jarrod Johnson notifications@github.com<mailto:notifications@github.com> Sent: 26 August 2020 17:01 To: xcat2/xcat-core xcat-core@noreply.github.com<mailto:xcat-core@noreply.github.com> Cc: Chiu, Peter (STFC,RAL,RALSP) peter.chiu@stfc.ac.uk<mailto:peter.chiu@stfc.ac.uk>; Author author@noreply.github.com<mailto:author@noreply.github.com> Subject: Re: [xcat2/xcat-core] Centos 7: clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade (#6815)

Do you understand why the firmware would have said: CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA

CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140

GATEWAY IP: 130.246.32.254

Looks like the dhcp server sees the mac address as 8a, but the firmware prints BA which is odd.

In any event, based on the dhcp message about multiple entries for the address, I would probably stop dhcpd and edit dhcpd.leases and delete the lease entry that descripbes 130.246.32.141 and have only the host entry for the node.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcat2/xcat-core/issues/6815#issuecomment-680971002, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZKK4GLPMLLFURDPL4CYO3SCUWTXANCNFSM4QL7RZAQ.

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI.

cxhong commented 4 years ago

Thanks @pcmc for update and good finding.