xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
363 stars 171 forks source link

Failed to update node status during provision when there are more than 1 NICs in the provision network #3517

Open whowutwut opened 7 years ago

whowutwut commented 7 years ago

Somewhat related to #3150.... but now that the servicenode postscript is added to the node definition, xCAT is not able to detect errors when running the servicenode postscript and bubble up the error code...

Recreate error...

  1. Provision a service node and make sure it comes up OK...
  2. copycds a new osimage
  3. Forget to add the xcat software into the otherpkgdir (intentionally)
  4. Provision the service node
  5. commands against the service node will fail....
    Error: Unable to dispatch hierarchical sub-command to sn02:3001.  Error: Connection failure: IO::Socket::INET: connect: Connection refused at /opt/xcat/lib/perl/xCAT/Client.pm line 248.
  6. Looking at the service node , no indication of error...
    [root@fs3 consoles]# lsdef sn02 -i status
    Object name: sn02
    status=powering-on

    On the node itself.. /var/log/xcat/xcat.log... No error in the servicenode postscript...

/etc/systemd/system/xcatpostinit1.service generated
xcatpostinit1.service enabled
Wed Jul 19 09:37:19 EDT 2017 Running postscript: syslog
Wed Jul 19 09:37:19 EDT 2017 postscript syslog return with 0
Wed Jul 19 09:37:19 EDT 2017 Running postscript: remoteshell
Wed Jul 19 09:37:21 EDT 2017 postscript remoteshell return with 0
Wed Jul 19 09:37:21 EDT 2017 Running postscript: syncfiles
Wed Jul 19 09:37:21 EDT 2017 postscript syncfiles return with 0
Wed Jul 19 09:37:21 EDT 2017 Running postscript: mlnxofed_ib_install
Wed Jul 19 09:37:24 EDT 2017 postscript mlnxofed_ib_install return with 0
Wed Jul 19 09:37:24 EDT 2017 Running postscript: hardeths
Wed Jul 19 09:37:24 EDT 2017 postscript hardeths return with 0
Wed Jul 19 09:37:24 EDT 2017 Running postscript: confignics
Wed Jul 19 09:37:32 EDT 2017 postscript confignics return with 0
Wed Jul 19 09:37:32 EDT 2017 Running postscript: servicenode
Wed Jul 19 09:37:35 EDT 2017 postscript servicenode return with 0
running /xcatpost/mypostscript.post
Wed Jul 19 09:41:31 EDT 2017 Running postbootscript: otherpkgs
Wed Jul 19 09:41:35 EDT 2017 postbootscript otherpkgs return with 4
/xcatpost/mypostscript.post return

I don't know what postbootscript return code 4 is, and how is this bubbled up to the MN? xCAT should be single point of control, the status should reflect in the node definition somewhere.

immarvin commented 7 years ago

hi @whowutwut , the status "powering-on" after provision indicates there is some problem while the sn02 tried to report its status during provision. I think this is a issue under the scenario that there are multiple nics within same subnet on the provisioned node.

the journal log of xcatd on fs3 uncovered the real problem:

[root@fs3 ~]# systemctl status xcatd -l
● xcatd.service - LSB: xcatd
   Loaded: loaded (/etc/rc.d/init.d/xcatd; bad; vendor preset: disabled)
   Active: active (running) since Wed 2017-07-19 23:46:01 EDT; 1 day 1h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 56539 ExecStop=/etc/rc.d/init.d/xcatd stop (code=exited, status=0/SUCCESS)
  Process: 56607 ExecStart=/etc/rc.d/init.d/xcatd start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/xcatd.service
           ├─56641 /usr/sbin/in.tftpd -v -l -s /tftpboot -m /etc/tftpmapfile4xcat.conf
           ├─56642 xcatd: SSL listene
           ├─56643 xcatd: DB Acces
           ├─56645 xcatd: UDP listene
           ├─56646 xcatd: install monito
           ├─56647 xcatd: Discovery worke
           └─56648 xcatd: Command log write

Jul 21 01:06:11 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:11 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:21 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:21 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:31 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:31 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:41 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:41 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:52 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:52 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored

the information above shows that the ip address of sn02 which connecting to the fs3 to update its status is reverse resolved to sn02-enP1p1s0f4, this is not the node object name "sn02" defined in MN, according to the code logic, xcatd will ignore this connection and the status of sn02 will not be updated.

there are multiple nics on sn02:

[root@sn02 xcatpost]# ip -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
1: lo    inet6 ::1/128 scope host \       valid_lft forever preferred_lft forever
2: enP9p7s0f0    inet 172.21.254.2/16 brd 172.21.255.255 scope global enP9p7s0f0\       valid_lft forever preferred_lft forever
2: enP9p7s0f0    inet6 fe80::72e2:84ff:fe14:a71/64 scope link \       valid_lft forever preferred_lft forever
4: enP8p1s0f0    inet 172.20.254.2/16 brd 172.20.255.255 scope global enP8p1s0f0\       valid_lft forever preferred_lft forever
4: enP8p1s0f0    inet6 fe80::e61d:2dff:fefd:b884/64 scope link \       valid_lft forever preferred_lft forever
6: enP1p1s0f4    inet 172.21.254.22/16 brd 172.21.255.255 scope global enP1p1s0f4\       valid_lft forever preferred_lft forever
6: enP1p1s0f4    inet6 fe80::5ef3:fcff:fe32:b490/64 scope link \       valid_lft forever preferred_lft forever

while sn02 tries to connect to fs3(172.21.253.27) to update its status, the source ip address is determined by the route table.

[root@sn02 xcatpost]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    0      0        0 enP1p1s0f4
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 enP9p7s0f0
link-local      0.0.0.0         255.255.0.0     U     1004   0        0 enP8p1s0f0
link-local      0.0.0.0         255.255.0.0     U     1006   0        0 enP1p1s0f4
172.20.0.0      0.0.0.0         255.255.0.0     U     0      0        0 enP8p1s0f0
172.21.0.0      0.0.0.0         255.255.0.0     U     0      0        0 enP1p1s0f4
172.21.0.0      0.0.0.0         255.255.0.0     U     0      0        0 enP9p7s0f0
immarvin commented 7 years ago

might need to find a way to associate all the hostname or aliases of all the nics on a node

immarvin commented 7 years ago

hi @whowutwut , since the network plan has been changed and there won't be 2 NICs on SN in the provision network anymore, can we change the priority to "LOW"?

whowutwut commented 7 years ago

@immarvin not sure why it was the network that caused the problem? Do you mean that the xCAT-sn software was successfully installed? or it did not install because of connection issues and the real problem is the connection issue?

In any case, whatever the reason that causes the rpms to not get installed on the node, shouldn't we have the servicenode postscript print either a success message or a failure message into the logs?

The servicenode postscript job is to install/configre the SN, if it does not do that, it should never return 0....

whowutwut commented 7 years ago

If we do step 3

Forget to add the xcat software into the otherpkgdir (intentionally)

but we don't have a complicated networking, the servicenode postscript will return non zero?

immarvin commented 7 years ago

the xCATsn package is installed by otherpkgs which is a postbootscript, servicenode is a postscript . postbootscript is run at the end of os provision but before the 1st reboot, postbootscript is run at the end of 1st reboot, so there is no chance for servicenode to check the installation of xCATsn. in fact servicenode script obtained some SSL certs, ssh keys and finished some configuration stuff as a preparation for servicenode, then when the xCATsn package is installed with otherpkgs, all these configuration stuff will take effect and the node become a service node

immarvin commented 7 years ago

not sure why it was the network that caused the problem? , the problem is updateflag. according to the xcatd code logic, when the cn report its status with updateflag, a TCP socket will be established to MN, then MN will perform a reverse DNS lookup against the source ip address of the socket to identify which node is reporting its status. IF there are 2 NICs on CN inside the same subnet with MN, then the source ip of the socket to MN depends on the route table on CN, there is no guarantee that the route entry of which NIC will be first matched.