xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
359 stars 171 forks source link

Diskless node deployment with different VLANs. TFTP file not found #7304

Open PabloMariaMera opened 1 year ago

PabloMariaMera commented 1 year ago

Hi everyone. We are trying to do a diskless deployment through xCAT 2.16. We have several racks and each one has a subnet. Our custom RHEL8 images are ready and worked in our test environment with a single VLAN. When we try to replicate the network but including different subnets, it does not start. We have tried modifying /etc/dhcpd.conf manually: nodes get their IP, but TFTP fails.

How can we fix this? We are trying to upgrade from RHEL7 Perceus to RHEL8 xCAT.

besawn commented 1 year ago

Are you using service nodes or only a management node?

Can you provide this output from your management node:

lsdef -t network -l
tabdump site | grep dhcpinterfaces
ip route
PabloMariaMera commented 1 year ago

Thanks for the quick response.

Our intention is to deploy all diskless servers with different custom images, using netboot and mounting shared folders.

Here you have more information about our deployment:

NETWORKS TABDUMP DHCPINTERFACES IP ROUTE OSIMAGE NODE CENSURED

We don't really need dynamic range because we want static IPs. It wasn't working so we modified dhcpd.conf and added the host by hand:

/etc/dhcp/dhcpd.conf DHCPD CONF HOST CENSURED /var/log/messages TFTP FILENAME NOT FOUND

Any suggestions or help would be very much appreciated, we are a bit blocked.

Thank you for you time.

PabloMariaMera commented 1 year ago

Add that when we indicate 172.17.0.1, the switch forwards to the main server, in this case 172.17.31.1. We have tried both IPs just in case, but it does not seem to be a connectivity problem.

besawn commented 1 year ago

I would recommend not modifying the dhcpd.conf by hand.

We don't really need dynamic range because we want static IPs.

You don't need a dynamic range to boot a node with a diskless image, but you do need to have the mac address/IP address association configured for that node. During normal operation, xCAT adds the entry for the host to the DHCP configuration when you run makedhcp cons0201.

There is a discrepancy between your network table configuration and the actual network interface configuration on your management node. Specifically, the netmask for the management_21_network is 255.255.255.0 and your bond0 netmask is 255.255.0.0. You should correct your network table entries so they match your actual management node network interfaces. I would suggest trying to get things working with a flat network configuration first. If you can get that to work, you can go back and implement a more sophisticated network scheme. To correct your network table entries, you can use makenetworks, chdef, or tabedit networks. Whenever you make changes to your network table entries, you need to re-run makedhcp -n to add the updates to the DHCP server configuration on the management node.

If you are confident that the mac and ip attributes are correct for cons0201, I would suggest you try the following:

Start by saving your existing dhcpd.conf in case you need to preserve any of your manual changes.

# Re-generate a fresh DHCP configuration using
makedhcp -n

# Add your node mac address / IP information to your DHCP configuration
makedhcp cons0201 

# Check that the configuration matches what you expect
makedhcp -q cons0201

# Try to boot the node with the diskless image
nodeset cons0201 osimage=rhels8.4.0-x86_64-cons-compute
rpower cons0201 boot

# Once the rpower starts, you can further debug by watching the boot process in the console
rcons cons0201 

If you are still experiencing problems, try xcatprobe to see if it detects any issues.

xcatprobe xcatmn will check for configuration problems on the management node. xcatprobe osimagecheck will check for issues with your osimages. xcatprobe osdeploy -n cons0201 will allow you to monitor the node install process while a node is booting to look for problems.