xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
363 stars 171 forks source link

confignetwork and setroute postscripts - subsequent runs on RHEL8 #6851

Open shawn174 opened 4 years ago

shawn174 commented 4 years ago

confignetwork works fine as a postscript for the initial OS build. A subsequent run with updatenode confignetwork causes an incorrect network config (especially for bond interfaces and bond+vlan interfaces). After a reboot, it still isn't correct and I try a updatenode confignetwork without success. This issue also impacts the use of the setroute postscript. Subsequent runs of setroute cause many of the interfaces to get the same default route with different route metrics. My first suspect is NetworkManager causing this, but that's based on past experience

The clients are running RHEL 8.1 with: NetworkManager-team-1.20.0-5.el8_1.x86_64 NetworkManager-1.20.0-5.el8_1.x86_64 NetworkManager-tui-1.20.0-5.el8_1.x86_64 NetworkManager-libnm-1.20.0-5.el8_1.x86_64

The config is: eno2 (management interface) ib0-\ bond0 - bond0 = IP for ib ib2-/ ens2f0-\ bond1 -> VLAN 3128 -> bond1.3128 = IP for extnet and default route ens2f1-/

Here's an lsdef of one of the nodes: lsdef dm3 Object name: dm3 arch=x86_64 bmc=dm3-lom bmcport=0 chain=runcmd=bmcsetup,shell currchain=boot currstate=boot groups=dssg-3-0,dssg,datamover hostnames=dm3 interface=eno2 ip=10.10.10.21 mac=38:68:DD:2D:EA:D9 mgt=ipmi netboot=xnba nfsdir=/install nfsserver=10.10.10.2 nicaliases.bond0=dm3-ib.net1.net nicdevices.bond0=ib0|ib1 nicdevices.bond1=ens2f0|ens2f1 nicdevices.bond1.3128=bond1 nicextraparams.bond0=BONDING_OPTS="mode=active-backup;primary=ib0;miimon=100" MTU=4092 nicextraparams.bond1=BONDING_OPTS="mode=4;miimon=100" MTU=1500 nichostnamesuffixes.bond1.3128=-ext nichostnamesuffixes.eno1=-dev nichostnamesuffixes.bond0=-ib nicips.eno1=172.16.140.65 nicips.eno2=10.10.10.21 nicips.bond0=172.16.138.26 nicips.bond1.3128=10.231.16.47 nicnetworks.eno1=dev nicnetworks.eno2=storage nicnetworks.bond0=ib nicnetworks.bond1.3128=ext nictypes.bond0=Bond nictypes.bond1.3128=vlan nictypes.ens2f1=Ethernet nictypes.eno2=Ethernet nictypes.ens2f0=Ethernet nictypes.eno1=Ethernet nictypes.bond1=Bond nictypes.ib1=Infiniband nictypes.ib0=Infiniband nodetype=osi ondiscover=nodediscover os=rhels8.1 otherinterfaces=-lom:172.16.140.64,-dev:172.16.140.65,-ib:172.16.138.26,-ext:10.231.16.109 postbootscripts=otherpkgs postscripts=syslog,remoteshell,syncfiles,confignetwork,setroute profile=dssgserver provmethod=datamover routenames=ext serialport=0 serialspeed=115200 status=powering-off statustime=10-06-2020 12:39:51 updatestatus=synced updatestatustime=10-06-2020 12:47:02

lsdef -t network ext Object name: ext domain=foo.net1.net gateway=10.231.16.1 mask=255.255.254.0 mtu=1500 net=10.231.16.0

lsdef -t network ib Object name: ib domain=dssg.local mask=255.255.254.0 mtu=4092 net=172.16.138.0 nodehostname=-ib

tabdump routes

routename,net,mask,gateway,ifname,comments,disable

"ext","0.0.0.0","0.0.0.0","10.99.92.1","bond1.3128",,

shawn174 commented 4 years ago

A quick update - I upgrade the RHEL 8.1 node with the latest NetworkManager packages: NetworkManager-libnm-1.22.8-5.el8_2.x86_64 NetworkManager-1.22.8-5.el8_2.x86_64 NetworkManager-team-1.22.8-5.el8_2.x86_64 NetworkManager-tui-1.22.8-5.el8_2.x86_64

updatenode confignetwork,setroute now works. With active bond interfaces, it takes 2 runs for the config to be correct, but this is understandable since it has to down the bond and re-configure them.

I'll keep stress-testing this to see if it's for-sure a fix.

viniciusferrao commented 3 years ago

I think I'm with a similar issue and I was blaming Infiniband since configibs fails:

configure nic and its device : ib0 [I]: Call configib for IB nics: ib0, ports: [I]: NMCLI_USED=2 NIC_IBNICS=ib0 NIC_IBAPORTS= configib [E]:Error: configib failed.

But still there's this issue:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ configure nic and its device : bond0.123 bond0 [I]: create_vlan_interface ifname=bond0 vlanid=123 [I]: Pickup xcatnet, "ceph-sync", from NICNETWORKS for interface "bond0". [I]: ip link add link bond0 name bond0.123 type vlan id 123 [I]: ip link set bond0.123 up [I]: State of "bond0.123" was "UNKNOWN" instead of expected "UP". Wait 0 of 200 with interval 1. [I]: create_persistent_ifcfg ifname=bond0.123 xcatnet=ceph-sync inattrs=ONBOOT=yes,USERCTL=no,VLAN=yes,MTU=1500 ['ifcfg-bond0.123'] [I]: >> ONBOOT="yes" [I]: >> USERCTL="no" [I]: >> VLAN="yes" [I]: >> MTU="1500" [I]: >> DEVICE="bond0.123" [I]: >> BOOTPROTO="static" [I]: >> IPADDR="192.168.168.22" [I]: >> NETMASK="255.255.255.0" [I]: >> NAME="bond0.123" Mon Jun 14 16:09:45 -03 2021 [info]: xcat.deployment.postscript: postscript confignetwork return with 1

At the end of the deployment I end up with:

I sent a message to the user mailing list but I'm on the debugging phase right now.

kjhee43 commented 2 years ago

@viniciusferrao regarding this issue:


At the end of the deployment I end up with:

    Wrong hostname; it's using the hostname from ib0 interface instead of the management interface

This is a network manager issue. We were able to prevent this by adding a file named 01-disable-name-change.conf in /etc/NetworkManager/conf.d

with contents

[main]
hostname-mode=none