xcp-ng / xcp

Entry point for issues and wiki. Also contains some scripts and sources.
https://xcp-ng.org
1.27k stars 74 forks source link

Unable to PXE boot VMs with TripleO, out of date PXE on 8.21 #559

Open timolow opened 2 years ago

timolow commented 2 years ago

Linking the issue I raised in ipxe repo here: https://github.com/xcp-ng-rpms/ipxe/issues/1

I have converted my lab from VMware to XCP-ng 8.21, I am trying to get my Openstack (TripleO) lab deployed again I am hitting a snag in the PXE version of the Xen guest. Is there any way to get PXE upgraded for the guest VMs? When the VMs attempt to PXE boot they spam "Unrecognized Option --timeout" and refuses to boot. https://bugzilla.redhat.com/show_bug.cgi?id=1343649

TripleO/RDO is a deployment stack based on ansible/puppet/podman and will create a fully running "Overcloud". The PXE functionality is used to inspect the hardware config, install the base OS and also used for cleaning the nodes for retirement or redeploy. PXE booting is a fundamental technology in managing the lifecycle of overcloud nodes. A quick primer on the use of PXE with openstack: https://tripleo-docs.readthedocs.io/en/latest/environments/baremetal.html

stormi commented 2 years ago

Here are the RPM sources for our ipxe and ipxe-efi packages: https://github.com/xcp-ng-rpms/ipxe and https://github.com/xcp-ng-rpms/ipxe-efi. They were inherited from XenServer when we forked.

Last time I discussed the matter with a developer at Citrix, they told me that things tended to break in subtle ways when they upgraded ipxe, which would explain why the version shipped is so old.

Maybe the issue could be fixed with a simple patch to ipxe to make it handle the --timeout option. Or maybe you could try to build a more recent version (https://xcp-ng.org/docs/develprocess.html#local-rpm-build), replace it and see how it goes?

We have different version for BIOS and UEFI, by the way. Do both cause the issue you reported?

timolow commented 2 years ago

I got a bit further on this issue but hit a dead end. I did enable UEFI and got the introspection going, however when deploying the overcloud things fall apart.

It tires to PXE boot and it gets this far before dropping me into the UEFI shell.

Start PXE over IPv4 Station IP address is 10.1.2.5 Server IP address is 10.1.2.2 NBP filename is undioly.kpxe NBP filesize is 73125 Bytes

Download NBP file...

NBP file downloaded successfully.

Start PXE over IPv4.

sometime later is drops me into a EFI shell.

stormi commented 2 years ago

So this looks like a different issue now.

NBP filename is undioly.kpxe is this a typo in your comment? It should be undionly.kpxe if I remember correctly.

timolow commented 2 years ago

Yes, it was a transcription error between the screen output and github. pxe-boot-issue

timolow commented 2 years ago

I ended up spending a few hours working on this and troubleshooting. I first tried to edit out the --timeout command from my tripleo deployment to bypass the issue, that worked for the introspection of the nodes but once the system tired to load the overcloud VM it would just timeout or reboot the VM.

I then moved on and extracted roms from: ipxe-roms-qemu-20180825-3.git133f4c.el7.noarch.rpm and ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch.rpm (both of these versions work fine), I first tried the "cat rtl8139.rom 8086100e.rom > /usr/share/ipxe/ipxe.bin", the realtek ethernet card worked pxe booting, however the e1000 rom never loaded and was greeted by a no boot devices found, shutting down in 30 seconds. Spent some time on it but ended up finding that if you simply copy 8086100e.rom to /usr/share/ipxe/ipxe-1000e.bin both the realtek and e1000 fully pxe boot the tripleo environment.

stormi commented 2 years ago

Thanks for the feedback. So as I understand it, you currently have a workaround which consists in:

Do you also still need to modify /usr/share/ipxe/ipxe.bin for the realtek card?