xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
356 stars 170 forks source link

Fail to deploy image over a tagged vlan #7160

Open urielrosen1981 opened 2 years ago

urielrosen1981 commented 2 years ago

Hello,

I am unsuccessful in installing new osimage over tagged vlan which was configured on the bios of the server , pxe boot starts and then I get a message that there are no more network devices. I attached the node defenitions . please advice if you see a missing or wrong setting.

image

lsdef ai-slurm-g9 Object name: ai-slurm-g9 arch=x86_64 bmc=10.26.19.209 bmcpassword=admin bmcport=0 bmcusername=***** cmdmapping=115200 cons=ipmi consoleenabled=1 consoleondemand=hard currchain=boot currstate=install rocky8-x86_64-compute getmac=1 installnic=eth0 ip=10.26.36.109 mac=04:3f:72:db:77:44 mgt=ipmi netboot=xnba nfsserver=10.26.36.80 nicdevices.bond0-port1=bond0 nicdevices.bond0-port2=bond0 nicips.bond0-port1=10.26.36.109 nictypes.bond0-port1=vlan nictypes.ens7f0=ethernet nictypes.bond0-port2=vlan nictypes.ens1f0=ethernet os=rocky8 postbootscripts=otherpkgs postscripts=syslog,remoteshell,syncfiles primarynic=eth0 profile=compute provmethod=rocky8.5 serialflow=115200 serialspeed=1 status=powering-on statustime=05-01-2022 10:29:15 tftpserver=10.26.36.80 updatestatus=synced updatestatustime=12-23-2020 16:21:29

gurevichmark commented 2 years ago

@urielrosen1981 Have you tried installing this image over non-tagged VLAN ?

urielrosen1981 commented 2 years ago

Hi,

Thanks for your reply , I tested installing anther node over non-tagged VLAN now and it was successful , can you help me debug and solve this issue ?

besawn commented 2 years ago

xNBA is based on iPXE. xCAT does not provide any special handling to enable iPXE VLAN features.

I think there are two problems here: 1.) xCAT does not provide support for tagged vlans in xNBA 2.) There is an open issue in iPXE related to tagged VLANs: https://github.com/ipxe/ipxe/issues/369

I think using an untagged VLAN is probably the easiest solution. Is that an option for your use case?

urielrosen1981 commented 2 years ago

Hi,

Sorry but our network design currently is only using VLAN tagging , is there any workaround you know I can use , alternatively , do you have any estimate when this bug will be fixed ?

besawn commented 2 years ago

do you have any estimate when this bug will be fixed ?

xNBA vlan tagging is not a priority for the xCAT core team, so any improvements related to this request will need to be driven by community members such as yourself.

is there any workaround you know I can use

Possible workarounds you could attempt: 1.) You could try to manually modify the xNBA file that contains the boot commands located at /tftpboot/xcat/xnba/nodes/ai-slurm-g9 on your management node to add the necessary iPXE commands to create the tagged vlan. A simple example is described here: https://ipxe.org/scripting, I think the vcreate command is what you need.

2.) You could try to replace the current version of xnba-undi installed on your management node with the older 1.0.3 version available here: https://xcat.org/files/xcat/repos/yum/2.16/xcat-dep/xnba-undi-1.0.3-131028.noarch.rpm to see if it behaves the same way or not. I think this experiment is worth trying, but I am not sure if it will solve your problem.

3.) You may need to combine 1 and 2 to add the call to vcreate and get a version of xNBA that is not impacted by https://github.com/ipxe/ipxe/issues/369.

4.) You could try to build a custom version of xNBA that includes a custom script to configure the vlan for your environment.

If you can can report the results of your investigation here, we can try to continue to assist with suggestions.

gurevichmark commented 2 years ago

Instructions for step 4) can be found here: https://github.com/xcat2/xcat-dep/blob/master/xnba/README

urielrosen1981 commented 2 years ago

Thanks for your suggestions , I just tried steps 1 and 2 but I have some questions . tried to modify the /tftpboot/xcat/xnba/nodes/ai-slurm-g9 file but I see that after each "rinstall" when I try to deploy the image the file is rewritten to the default file , is this behavior normal and should I just add my custom commands each time after I run "rinstall" ? anyway , I attached the output of the file bellow , so far it didn't work trying to create my tagged VLAN , please tell me if I have done this correctly or I need to modify the file, I need to create a VLAN 36 and use DHCP to start the installation. another issue now , after reaching net 1 I lose the display and cannot see anything on the console monitor. I wanted to try to debug this by entering the XNBA shell using "ctrl + b" but that didn't work also. waiting for your input. thanks.

!gpxe

install rocky8-x86_64-compute

vcreate --tag 36 net0 autoboot net0-36 imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/rocky8-x86_64-install-compute/vmlinuz imgload kernel imgargs kernel quiet inst.repo=http://10.26.36.80:80/install/rocky8/x86_64 inst.ks=http://10.26.36.80:80/install/autoinst/ai-slurm-g9 ip=ens1f0:dhcp inst.sshd inst.loglevel=debug inst.syslog=10.26.36.80 BOOTIF=01-${netX/machyp} imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/rocky8-x86_64-install-compute/initrd.img imgexec kernel

besawn commented 2 years ago

tried to modify the /tftpboot/xcat/xnba/nodes/ai-slurm-g9 file but I see that after each "rinstall" when I try to deploy the image the file is rewritten to the default file , is this behavior normal and should I just add my custom commands each time after I run "rinstall" ?

Everytime rinstall or nodeset is run, the tftpboot files will be regenerated, this is normal behavior. rinstall is a convenience command that combines a few other commands together into a single operation. For the test you are attempting, I would recommend using nodeset instead of rinstall so you can modify the boot file after the nodeset, but before the install starts. Some more information here: https://xcat-docs.readthedocs.io/en/stable/guides/admin-guides/manage_clusters/ppc64le/diskful/deploy_os.html?highlight=nodeset Process should be something like:

nodeset ai-slurm-g9 osimage=rocky8-x86_64-install-compute
Modify /tftpboot/xcat/xnba/nodes/ai-slurm-g9
rsetboot ai-slurm-g9 net
rpower ai-slurm-g9 reset

please tell me if I have done this correctly or I need to modify the file

I don't have any existing experience trying to boot iPXE/xNBA over tagged VLAN, so I don't have any specific advice on the actual commands. I was re-reading this issue: https://github.com/ipxe/ipxe/issues/369 and I noticed another problem: In your original screenshot above, there is a "Features" line that shows which iPXE features have been compiled into xNBA/iPXE. xNBA does not have the VLAN feature listed, so it most likely does not support the vcreate command.

You will need to rebuild xNBA using the instructions @gurevichmark pointed to above, but enable the VLAN feature using VLAN_CMD as described in the iPXE issue. However, you will also need to patch the code to avoid the problem described in the iPXE issue.

urielrosen1981 commented 2 years ago

Thanks for your reply,

I tried rebuilding ipxe with VLAN commands but have some difficulty and would appreciate your advice, first I will describe the steps I took and the error I encountered . git clone https://git.ipxe.org/ipxe.git

mv ipxe xnba-1.21.1

in xnba-1.21.1/src/config/general.h file I added this line for VLAN support:

define VLAN_CMD / VLAN commands /

ran cd src; make

copied 5 patch files and xnba-1.21.1.tar.bz2 file to /root/rpmbuild/SOURCES/ then ran rpmbuild -ba xnba-undi.spec to rebuild the xnba rpm and got the below error.

-rw-r--r-- root/root 15247 2022-05-08 12:03 xnba-1.21.1/src/util/zbin.c -rwxr-xr-x root/root 38040 2022-05-08 12:52 xnba-1.21.1/src/util/zbin -rw-r--r-- root/root 0 2022-05-08 12:51 xnba-1.21.1/src/.echocheck

Do you know why I got this error ? is there a working rpm you can provide for me to download with VLAN support perhaps?

gurevichmark commented 2 years ago

@urielrosen1981 What OS are you running rpmbuild -ba xnba-undi.spec command on ?

One thing you can try is add this line to xnba-undi.spec somewhere before %define lines there:

%global _default_patch_fuzz 3
urielrosen1981 commented 2 years ago

thanks , now it ran for a couple of minutes before failing , error is below.

(.text16.data+0x76): undefined reference to _data16_memsz' bin-x86_64-efi/blib.a(pxe_entry.o): In functionpxenv': (.text16.data+0x82): undefined reference to _data16_memsz' bin-x86_64-efi/blib.a(pxe_entry.o): In functionpxenv': (.text16.data+0x86): undefined reference to `_text16_memsz' make: *** [bin-x86_64-efi/snponly.efi.tmp] Error 1 rm bin-x86_64-efi/version.snponly.efi.o error: Bad exit status from /var/tmp/rpm-tmp.gHOO37 (%build)

RPM build errors: Bad exit status from /var/tmp/rpm-tmp.gHOO37 (%build)

gurevichmark commented 2 years ago

@urielrosen1981 What OS are you running on ? Have you tried without your changes to xnba-1.21.1/src/config/general.h file ?

urielrosen1981 commented 2 years ago

CentOS Linux release 7.9.2009 (Core)

Did not modify xnba-1.21.1/src/config/general.h file.

gurevichmark commented 2 years ago

Oh, I thought earlier you posted:

in xnba-1.21.1/src/config/general.h file I added this line for VLAN support:

define VLAN_CMD / VLAN commands /

urielrosen1981 commented 2 years ago

Yes , you are right I forgot about this , anyway, I was able to build the rpm and install it but now I get the below error , do you have any idea what is wrong now?

image

gurevichmark commented 2 years ago

@urielrosen1981 Verify that your management server has the /tftpboot/xcat/xnba.efi file matching the time you built it with rpmbuild command and with "read for all" permissions.

urielrosen1981 commented 2 years ago

The file has read for all

ls -ltr /tftpboot/xcat/xnba.efi -rw-r--r-- 1 root root 139200 Oct 28 2013 /tftpboot/xcat/xnba.efi

not sure I understand what you mean "file matching the time you built it with rpmbuild command"

could you please explain how to check this?

gurevichmark commented 2 years ago

The rpmbuild -ba xnba-undi.spec command should have generated a new xnba-undi-1.21.1-1.noarch RPM. That RPM should contain updated files:

[root@c910f04x40 ~]# rpm -qll xnba-undi-1.21.1-1.noarch
/tftpboot/xcat/xnba.efi
/tftpboot/xcat/xnba.kpxe
[root@c910f04x40 ~]#

Uninstalling your existing xnba-undi and then installing the new one, should have replaced those 2 files under /tftpboot/xcat

It looks like your files are from 2013, so maybe you have not installed the xnba-undi RPM generated by rpmbuild ?

urielrosen1981 commented 2 years ago

I think you are correct , the install didn't finish correctly , how do you suggest to overwrite the new rpm over the existing one to work in my system ?

gurevichmark commented 2 years ago

Try rpm -U on the xnba-undi rpm file generated by rpmbuild command.

urielrosen1981 commented 2 years ago

rpm -U xnba-undi-1.21.1-1.noarch.rpm package xnba-undi-1.21.1-1.noarch is already installed file /tftpboot/xcat/xnba.efi from install of xnba-undi-1.21.1-1.noarch conflicts with file from package xnba-undi-1.21.1-1.noarch file /tftpboot/xcat/xnba.kpxe from install of xnba-undi-1.21.1-1.noarch conflicts with file from package xnba-undi-1.21.1-1.noarch

I get this conflict tried to move these 1files aside but that didn't help .

gurevichmark commented 2 years ago

Try to remove existing rpm with rpm -e and then install the new one.

urielrosen1981 commented 2 years ago

I cannot because xCAT-2.16.3-snap202111100958.x86_64 depands on it xnba-undi-1.21.1-1.noarch.rpm

if I also remove xCAT-2.16.3-snap202111100958.x86_64 will this not harm the entire installation of xCAT?

gurevichmark commented 2 years ago

How about rpm -U --replacefiles --replacepkgs xnba-undi-1.21.1-1.noarch.rpm ?

If that fails, you can bump up to 1.21.2 the version number in xnba-undi.spec, rebuild the rpm with rpmbuild command again. That should generate xnba-undi-1.21.2-1.noarch.rpm and you can try to install it with rpm -U xnba-undi-1.21.2-1.noarch.rpm

urielrosen1981 commented 2 years ago

Thanks , this worked so I was able to run the vcreate command but got an error that was mentioned in ipxe git error you pointed me to. I asked them how to get the commit which solves this error unless you know how to do this and can share this with me.

image

urielrosen1981 commented 2 years ago

I was able to find and install the correct ipxe version with vlan support(https://github.com/ipxe/ipxe/commit/eecb75ba) Thanks a lot !!!. I am now faced a few more issues :

1 .I am only able to continue the boot from xnba prompt , when I enter the same commands that worked in the shell in the node script it doesn't work 👍

cat /tftpboot/xcat/xnba/nodes/ai-slurm-g9.uefi

!gpxe

vcreate --tag 36 -p 0 net0 autoboot imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/rocky8-x86_64-install-compute/vmlinuz imgload kernel imgargs kernel quiet inst.repo=http://10.26.36.80:80/install/rocky8/x86_64 inst.ks=http://10.26.36.80:80/install/autoinst/ai-slurm-g9 ip=ens1f0:dhcp inst.sshd inst.loglevel=debug inst.syslog=10.26.36.80 BOOTIF=01-${netX/mac:hexhyp} initrd=initrd imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/rocky8-x86_64-install-compute/initrd.img imgexec kernel

what am I doing wrong here?

  1. After install starts I receive the following error and the install hangs:

image

any suggestions ?

gurevichmark commented 2 years ago

@urielrosen1981

  1. Have you tried changing #!gpxe to #!ipxe in your /tftpboot/xcat/xnba/nodes/ai-slurm-g9.uefi file before running rpower ?
  2. In the screen shot that you posted, which error concerns you ? Does the install hang on the last displayed line Started cancel waiting... ? And hitting Enter a few times does not advance the display ?
urielrosen1981 commented 2 years ago

Hi,

changing to #!ipxe didn't make a change here, is there a way to debug this outside of xnba prompt? regarding the screenshot , it doesn't advance past this with enter , I am guessing the last line is the reason it hangs but not sure.

gurevichmark commented 2 years ago

Perhaps you can try posting to xcat-user mailing list to see if anyone in the community had success with tagged VLAN on x86 ?

urielrosen1981 commented 2 years ago

Thanks for your suggestion , I sent the details to the mailing list, hope to find a solution. Thanks again for all your kind help.