Open whowutwut opened 7 years ago
hi @whowutwut , the status "powering-on" after provision indicates there is some problem while the sn02 tried to report its status during provision. I think this is a issue under the scenario that there are multiple nics within same subnet on the provisioned node.
the journal log of xcatd on fs3 uncovered the real problem:
[root@fs3 ~]# systemctl status xcatd -l
● xcatd.service - LSB: xcatd
Loaded: loaded (/etc/rc.d/init.d/xcatd; bad; vendor preset: disabled)
Active: active (running) since Wed 2017-07-19 23:46:01 EDT; 1 day 1h ago
Docs: man:systemd-sysv-generator(8)
Process: 56539 ExecStop=/etc/rc.d/init.d/xcatd stop (code=exited, status=0/SUCCESS)
Process: 56607 ExecStart=/etc/rc.d/init.d/xcatd start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/xcatd.service
├─56641 /usr/sbin/in.tftpd -v -l -s /tftpboot -m /etc/tftpmapfile4xcat.conf
├─56642 xcatd: SSL listene
├─56643 xcatd: DB Acces
├─56645 xcatd: UDP listene
├─56646 xcatd: install monito
├─56647 xcatd: Discovery worke
└─56648 xcatd: Command log write
Jul 21 01:06:11 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:11 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:21 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:21 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:31 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:31 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:41 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:41 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:52 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
Jul 21 01:06:52 fs3 xcat[56646]: xcatd received a connection request from sn02-enP1p1s0f4, which can not be found in xCAT nodelist table. The connection request will be ignored
the information above shows that the ip address of sn02 which connecting to the fs3 to update its status is reverse resolved to sn02-enP1p1s0f4
, this is not the node object name "sn02" defined in MN, according to the code logic, xcatd will ignore this connection and the status of sn02 will not be updated.
there are multiple nics on sn02:
[root@sn02 xcatpost]# ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: enP9p7s0f0 inet 172.21.254.2/16 brd 172.21.255.255 scope global enP9p7s0f0\ valid_lft forever preferred_lft forever
2: enP9p7s0f0 inet6 fe80::72e2:84ff:fe14:a71/64 scope link \ valid_lft forever preferred_lft forever
4: enP8p1s0f0 inet 172.20.254.2/16 brd 172.20.255.255 scope global enP8p1s0f0\ valid_lft forever preferred_lft forever
4: enP8p1s0f0 inet6 fe80::e61d:2dff:fefd:b884/64 scope link \ valid_lft forever preferred_lft forever
6: enP1p1s0f4 inet 172.21.254.22/16 brd 172.21.255.255 scope global enP1p1s0f4\ valid_lft forever preferred_lft forever
6: enP1p1s0f4 inet6 fe80::5ef3:fcff:fe32:b490/64 scope link \ valid_lft forever preferred_lft forever
while sn02 tries to connect to fs3(172.21.253.27) to update its status, the source ip address is determined by the route table.
[root@sn02 xcatpost]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default gateway 0.0.0.0 UG 0 0 0 enP1p1s0f4
link-local 0.0.0.0 255.255.0.0 U 1002 0 0 enP9p7s0f0
link-local 0.0.0.0 255.255.0.0 U 1004 0 0 enP8p1s0f0
link-local 0.0.0.0 255.255.0.0 U 1006 0 0 enP1p1s0f4
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 enP8p1s0f0
172.21.0.0 0.0.0.0 255.255.0.0 U 0 0 0 enP1p1s0f4
172.21.0.0 0.0.0.0 255.255.0.0 U 0 0 0 enP9p7s0f0
might need to find a way to associate all the hostname or aliases of all the nics on a node
hi @whowutwut , since the network plan has been changed and there won't be 2 NICs on SN in the provision network anymore, can we change the priority to "LOW"?
@immarvin not sure why it was the network that caused the problem? Do you mean that the xCAT-sn software was successfully installed? or it did not install because of connection issues and the real problem is the connection issue?
In any case, whatever the reason that causes the rpms to not get installed on the node, shouldn't we have the servicenode
postscript print either a success message or a failure message into the logs?
The servicenode
postscript job is to install/configre the SN, if it does not do that, it should never return 0....
If we do step 3
Forget to add the xcat software into the otherpkgdir (intentionally)
but we don't have a complicated networking, the servicenode postscript will return non zero?
the xCATsn package is installed by otherpkgs
which is a postbootscript, servicenode
is a postscript . postbootscript is run at the end of os provision but before the 1st reboot, postbootscript is run at the end of 1st reboot, so there is no chance for servicenode
to check the installation of xCATsn. in fact servicenode
script obtained some SSL certs, ssh keys and finished some configuration stuff as a preparation for servicenode, then when the xCATsn package is installed with otherpkgs
, all these configuration stuff will take effect and the node become a service node
not sure why it was the network that caused the problem?
, the problem is updateflag. according to the xcatd code logic, when the cn report its status with updateflag
, a TCP socket will be established to MN, then MN will perform a reverse DNS lookup against the source ip address of the socket to identify which node is reporting its status. IF there are 2 NICs on CN inside the same subnet with MN, then the source ip of the socket to MN depends on the route table on CN, there is no guarantee that the route entry of which NIC will be first matched.
Somewhat related to #3150.... but now that the
servicenode
postscript is added to the node definition, xCAT is not able to detect errors when running the servicenode postscript and bubble up the error code...Recreate error...
On the node itself..
/var/log/xcat/xcat.log
... No error in the servicenode postscript...I don't know what postbootscript return code 4 is, and how is this bubbled up to the MN? xCAT should be single point of control, the status should reflect in the node definition somewhere.