Open shawn174 opened 4 years ago
A quick update - I upgrade the RHEL 8.1 node with the latest NetworkManager packages: NetworkManager-libnm-1.22.8-5.el8_2.x86_64 NetworkManager-1.22.8-5.el8_2.x86_64 NetworkManager-team-1.22.8-5.el8_2.x86_64 NetworkManager-tui-1.22.8-5.el8_2.x86_64
updatenode
I'll keep stress-testing this to see if it's for-sure a fix.
I think I'm with a similar issue and I was blaming Infiniband since configibs fails:
configure nic and its device : ib0 [I]: Call configib for IB nics: ib0, ports: [I]: NMCLI_USED=2 NIC_IBNICS=ib0 NIC_IBAPORTS= configib [E]:Error: configib failed.
But still there's this issue:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ configure nic and its device : bond0.123 bond0 [I]: create_vlan_interface ifname=bond0 vlanid=123 [I]: Pickup xcatnet, "ceph-sync", from NICNETWORKS for interface "bond0". [I]: ip link add link bond0 name bond0.123 type vlan id 123 [I]: ip link set bond0.123 up [I]: State of "bond0.123" was "UNKNOWN" instead of expected "UP". Wait 0 of 200 with interval 1. [I]: create_persistent_ifcfg ifname=bond0.123 xcatnet=ceph-sync inattrs=ONBOOT=yes,USERCTL=no,VLAN=yes,MTU=1500 ['ifcfg-bond0.123'] [I]: >> ONBOOT="yes" [I]: >> USERCTL="no" [I]: >> VLAN="yes" [I]: >> MTU="1500" [I]: >> DEVICE="bond0.123" [I]: >> BOOTPROTO="static" [I]: >> IPADDR="192.168.168.22" [I]: >> NETMASK="255.255.255.0" [I]: >> NAME="bond0.123" Mon Jun 14 16:09:45 -03 2021 [info]: xcat.deployment.postscript: postscript confignetwork return with 1
At the end of the deployment I end up with:
I sent a message to the user mailing list but I'm on the debugging phase right now.
@viniciusferrao regarding this issue:
At the end of the deployment I end up with:
Wrong hostname; it's using the hostname from ib0 interface instead of the management interface
This is a network manager issue. We were able to prevent this by adding a file named 01-disable-name-change.conf
in /etc/NetworkManager/conf.d
with contents
[main]
hostname-mode=none
confignetwork works fine as a postscript for the initial OS build. A subsequent run with updatenode confignetwork causes an incorrect network config (especially for bond interfaces and bond+vlan interfaces). After a reboot, it still isn't correct and I try a updatenode confignetwork without success. This issue also impacts the use of the setroute postscript. Subsequent runs of setroute cause many of the interfaces to get the same default route with different route metrics. My first suspect is NetworkManager causing this, but that's based on past experience
The clients are running RHEL 8.1 with: NetworkManager-team-1.20.0-5.el8_1.x86_64 NetworkManager-1.20.0-5.el8_1.x86_64 NetworkManager-tui-1.20.0-5.el8_1.x86_64 NetworkManager-libnm-1.20.0-5.el8_1.x86_64
The config is: eno2 (management interface) ib0-\ bond0 - bond0 = IP for ib ib2-/ ens2f0-\ bond1 -> VLAN 3128 -> bond1.3128 = IP for extnet and default route ens2f1-/
Here's an lsdef of one of the nodes: lsdef dm3 Object name: dm3 arch=x86_64 bmc=dm3-lom bmcport=0 chain=runcmd=bmcsetup,shell currchain=boot currstate=boot groups=dssg-3-0,dssg,datamover hostnames=dm3 interface=eno2 ip=10.10.10.21 mac=38:68:DD:2D:EA:D9 mgt=ipmi netboot=xnba nfsdir=/install nfsserver=10.10.10.2 nicaliases.bond0=dm3-ib.net1.net nicdevices.bond0=ib0|ib1 nicdevices.bond1=ens2f0|ens2f1 nicdevices.bond1.3128=bond1 nicextraparams.bond0=BONDING_OPTS="mode=active-backup;primary=ib0;miimon=100" MTU=4092 nicextraparams.bond1=BONDING_OPTS="mode=4;miimon=100" MTU=1500 nichostnamesuffixes.bond1.3128=-ext nichostnamesuffixes.eno1=-dev nichostnamesuffixes.bond0=-ib nicips.eno1=172.16.140.65 nicips.eno2=10.10.10.21 nicips.bond0=172.16.138.26 nicips.bond1.3128=10.231.16.47 nicnetworks.eno1=dev nicnetworks.eno2=storage nicnetworks.bond0=ib nicnetworks.bond1.3128=ext nictypes.bond0=Bond nictypes.bond1.3128=vlan nictypes.ens2f1=Ethernet nictypes.eno2=Ethernet nictypes.ens2f0=Ethernet nictypes.eno1=Ethernet nictypes.bond1=Bond nictypes.ib1=Infiniband nictypes.ib0=Infiniband nodetype=osi ondiscover=nodediscover os=rhels8.1 otherinterfaces=-lom:172.16.140.64,-dev:172.16.140.65,-ib:172.16.138.26,-ext:10.231.16.109 postbootscripts=otherpkgs postscripts=syslog,remoteshell,syncfiles,confignetwork,setroute profile=dssgserver provmethod=datamover routenames=ext serialport=0 serialspeed=115200 status=powering-off statustime=10-06-2020 12:39:51 updatestatus=synced updatestatustime=10-06-2020 12:47:02
lsdef -t network ext Object name: ext domain=foo.net1.net gateway=10.231.16.1 mask=255.255.254.0 mtu=1500 net=10.231.16.0
lsdef -t network ib Object name: ib domain=dssg.local mask=255.255.254.0 mtu=4092 net=172.16.138.0 nodehostname=-ib
tabdump routes
routename,net,mask,gateway,ifname,comments,disable
"ext","0.0.0.0","0.0.0.0","10.99.92.1","bond1.3128",,