xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
356 stars 170 forks source link

[fvt] xcatd failed to start in rhels7.5-snap2 diskless service node #4929

Closed hu-weihua closed 6 years ago

hu-weihua commented 6 years ago

xCAT build : Version 2.13.11 (git commit d0ed517ee20e271e05ba6ffd06f4e3deb0c240c0, built Fri Mar 9 07:15:44 EST 2018)

MN OS: Red Hat Enterprise Linux Server release 7.4 (Maipo) SN OS: rhels7.5-snap2-ppc64le

For xCAT 20180309 build, after installing diskless SN, the xcatd in SN failed to start.

This issue also happened on rhels7.4-alt-rc1 xCAT 20180308 build, no matter with xCAT-openbmc-py or without xCAT-openbmc-py.

[root@sn02 ~]# xcatd -f
Subroutine unsupported redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 102.
Subroutine handled_commands redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 120.
Subroutine preprocess_request redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 721.
Subroutine retry_after redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 837.
Subroutine retry_check_times redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 853.
Subroutine process_request redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 895.
Subroutine parse_args redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 1120.
Subroutine parse_command_status redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 1422.
Subroutine fork_process_login redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2117.
Subroutine get_functional_software_ids redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2148.
Subroutine parse_node_info redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2175.
Subroutine gen_send_request redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2256.
Subroutine deal_with_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2329.
Subroutine process_debug_info redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2459.
Subroutine login_request redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2482.
Subroutine login_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2518.
Subroutine rpower_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2544.
Subroutine rinv_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2741.
Subroutine getopenbmccons redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2901.
Subroutine rsetboot_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2932.
Subroutine rbeacon_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 2991.
Subroutine reventlog_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3030.
Subroutine is_callout_event_data redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3145.
Subroutine parse_event_data redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3171.
Subroutine rspconfig_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3250.
Subroutine rspconfig_api_config_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3577.
Subroutine rspconfig_sshcfg_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3683.
Subroutine sshcfg_process redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3722.
Subroutine rspconfig_dump_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3772.
Subroutine dump_download_process redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3907.
Subroutine dump_download_all_process redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 3990.
Subroutine rvitals_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 4012.
Subroutine rflash_response redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 4145.
Subroutine rflash_upload redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 4487.
Subroutine is_valid_config_api redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 4607.
Subroutine build_config_api_usage redefined at /opt/xcat/lib/perl/xCAT_plugin/openbmc.pm line 4639.
Starting conserver (via systemctl):  Job for conserver.service failed because the control process exited with error code. See "systemctl status conserver.service" and "journalctl -xe" for details.
                                                           [FAILED]
 dns server has been enabled on boot.
Unable to log  dns server has been enabled on boot. to syslog because of no connection to syslog available
    - unix dgram connect: Connection refused
    - stream can't open /dev/log: No such device or address at /opt/xcat/lib/perl/xCAT/MsgUtils.pm line 479.

Renamed existing dhcp configuration file to  /etc/dhcp/dhcpd.conf.xcatbak

xxWarning: The ip address 172.21.253.104 of node f6pdu15 overlaps with the DHCP dynamic range specified in networks table, will not add this ip address into dhcpd.leases file.
Warning: The ip address 172.21.253.105 of node f6pdu16 overlaps with the DHCP dynamic range specified in networks table, will not add this ip address into dhcpd.leases file.
ERROR: no connection to syslog available
    - unix dgram connect: Connection refused
    - stream can't open /dev/log: No such device or address at /opt/xcat/lib/perl/xCAT/MsgUtils.pm line 809.

The related definitions are:

[root@boston02 ppc64le]# lsdef -t osimage  rhels7.5-snap2-ppc64le-netboot-service
Object name: rhels7.5-snap2-ppc64le-netboot-service
    exlist=/opt/xcat/share/xcat/netboot/rh/service.rhels7.ppc64le.exlist
    imagetype=linux
    osarch=ppc64le
    osdistroname=rhels7.5-snap2-ppc64le
    osname=Linux
    osvers=rhels7.5-snap2
    otherpkgdir=/install/software/xcat/2.13.9
    otherpkglist=/install/software/xcat/2.13.9/service.rhels7.ppc64le.otherpkgs.pkglist
    permission=755
    pkgdir=/install/rhels7.5-snap2/ppc64le,/install/xcat-dep-python/epel/ppc64le,/install/xcat-dep-python/extras/ppc64le,/install/xcat-dep-python/fedora28/ppc64le
    pkglist=/opt/xcat/share/xcat/netboot/rh/service.rhels7.ppc64le.pkglist
    postinstall=/opt/xcat/share/xcat/netboot/rh/service.rhels7.ppc64le.postinstall
    postscripts=servicenode
    profile=service
    provmethod=netboot
    rootimgdir=/install/netboot/rhels7.5-snap2-ppc64le/ppc64le/service

[root@boston02 ppc64le]# lsdef sn02
Object name: sn02
    arch=ppc64le
    bmc=172.21.254.102
    chain=runcmd=bmcsetup,osimage=rhels7.5-snap2-ppc64le-install-service
    cons=openbmc
    consoleenabled=1
    cpucount=176
    cputype=POWER9 (raw), altivec supported
    currchain=boot
    currstate=netboot rhels7.5-snap2-ppc64le-service
    disksize=sda:932GB,sdb:932GB
    groups=all,service,coral_sn_regex
    ip=172.20.254.2
    mac=70:e2:84:14:24:75!*NOIP*|24:8a:07:a0:ae:e4
    memory=523462MB
    mgt=openbmc
    netboot=petitboot
    nicips.enP5p1s0f0=172.21.254.2
    nicnetworks.enP5p1s0f0=172_21_0_0-255_255_0_0
    nictypes.enP5p1s0f0=ethernet
    nodetype=mp
    os=rhels7.5-snap2
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles,servicenode,confignetwork
    profile=service
    provmethod=rhels7.5-snap2-ppc64le-netboot-service
    setupconserver=1
    setupdhcp=1
    setupnameserver=1
    setupnfs=1
    setuptftp=1
    status=powering-on
    statustime=03-14-2018 02:41:59
    supportedarchs=ppc64le
    switch=mid08
    switchport=52
    syslog=ignore
    updatestatus=failed
    updatestatustime=03-13-2018 22:23:20
    xcatmaster=172.20.253.31
robin2008 commented 6 years ago

Found that it caused by the syslog, and it seems the /dev/syslog issue has also reported many times.

https://unix.stackexchange.com/questions/317064/how-do-i-restore-dev-log-in-systemdrsyslog-host

Refer to the steps, we can workaround it

systemctl restart systemd-journald.socket && systemctl restart rsyslog

Not sure why xcatd need to exit when it cannot connect syslog, we can see below codes in xcatd.

unless (xCAT::Utils->isLinux()) { # messes up the output of the service cmd on linux
    eval {
        xCAT::MsgUtils->message("S", "xcatd: service starting");
    };
}
if ($@) {
    print "ERROR: $@";
    xexit;             <=======  here exit directly when cannot log to syslog
}
hu-weihua commented 6 years ago

After using the workaround @robin2008 mentioned above, the python version hardware control commands can work in hierarchical cluster.

robin2008 commented 6 years ago

setup an new environment try to recreate this issue, and not found. The osimage is created by following http://xcat-docs.readthedocs.io/en/stable/advanced/hierarchy/provision/diskless_sn.html

lsdef sn02 -i provmethod
Object name: sn02
    provmethod=rhels7.5-rc-ppc64le-netboot-service

[root@sn02 ~]# systemctl status xcatd
● xcatd.service - LSB: xcatd
   Loaded: loaded (/etc/rc.d/init.d/xcatd; bad; vendor preset: disabled)
   Active: active (running) since Thu 2018-03-15 03:04:50 EDT; 4min 35s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 5083 ExecStart=/etc/rc.d/init.d/xcatd start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/xcatd.service
           ├─5335 xcatd: SSL listener
           ├─5336 xcatd: DB Access
           ├─5337 xcatd: UDP listener
           ├─5338 xcatd: install monitor
           ├─5339 xcatd: Discovery worker
           └─5340 xcatd: Command log writer

Mar 15 03:04:48 sn02.pok.stglabs.ibm.com systemd[1]: Starting LSB: xcatd...
Mar 15 03:04:49 sn02.pok.stglabs.ibm.com xcatd[5083]: Starting xcatd  dns server has been enabled on boot.
Mar 15 03:04:49 sn02.pok.stglabs.ibm.com xcat[5092]:  dns server has been enabled on boot.
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com xcatd[5083]: Job for httpd.service failed because the control process exited with error code. See "systemc...details.
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com xcat[5340]: INFO xcatd: command log process 5340 start
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com xcatd[5083]: [  OK  ]
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com systemd[1]: Started LSB: xcatd.
Hint: Some lines were ellipsized, use -l to show in full.

lsxcatd -a
Version 2.13.11 (git commit 6b51474e72575f0432973016d2114f07083e9356, built Tue Mar 13 07:45:52 EDT 2018)
This is a Service Node
cfgloc=Pg:dbname=xcatdb;host=172.10.253.27|xcatadm
robin2008 commented 6 years ago

Plan to export the image from briggs01 and import it to boston02 and see what's going on.

robin2008 commented 6 years ago

Using the valid stateless SN image on sn02/boston02, the rsyslog issue is still there. The issue might be related to environment. Need to try more P9 server.

robin2008 commented 6 years ago

Found tricky things there by @cxhong , when enabling xcatdebugmode, the service node cannot work as expected.

robin2008 commented 6 years ago

workaround the MsgUtil.pm first to make sure xcatd will not crash when /dev/log is not ready. it is addressed in https://github.com/xcat2/xcat-core/pull/4974

Later we will investigate why /dev/log is not working when debug mode is enabled.

tingtli commented 6 years ago

I hit this issue again. Here is the version information MN OS: Red Hat Enterprise Linux Server 7.4 (Maipo) SN OS: Red Hat Enterprise Linux Server 7.5 (Maipo) xCAT Version

[root@sn02 ~]# lsxcatd -v
Version 2.14 (git commit f2958a9ef534a144231ae437206dbe1f3fabaeaf, built Mon Apr  2 06:15:46 EDT 2018)
robin2008 commented 6 years ago

Yes, with debugmode=1, we will still hit such issue, and python agent will try to log to syslog, so it exit with exception.

Traceback (most recent call last):
  File "/opt/xcat/lib/python/agent/xcatagent/server.py", line 87, in _handle
    req['module'])
  File "/opt/xcat/lib/python/agent/xcatagent/base.py", line 16, in get_manager_func
    __import__(module_name)
  File "/usr/lib64/python2.7/site-packages/gevent/builtins.py", line 93, in __import__
    result = _import(*args, **kwargs)
  File "/opt/xcat/lib/python/agent/xcatagent/openbmc.py", line 37, in <module>
    utils.enableSyslog('xcat.agent')
  File "/opt/xcat/lib/python/agent/common/utils.py", line 26, in enableSyslog
    h = SysLogHandler(address='/dev/log', facility=SysLogHandler.LOG_LOCAL4)
  File "/usr/lib64/python2.7/logging/handlers.py", line 761, in __init__
    self._connect_unixsocket(address)
  File "/usr/lib64/python2.7/logging/handlers.py", line 789, in _connect_unixsocket
    self.socket.connect(address)
  File "/usr/lib64/python2.7/site-packages/gevent/_socket2.py", line 238, in connect
    raise error(result, strerror(result))
error: [Errno 111] Connection refused
robin2008 commented 6 years ago

And I think it is an general issue, not only for diskless service node

[root@mid05tor12cn16 ~]# cat /proc/cmdline
imgurl=http://172.12.253.27:80//install/ansible/custom/rhels7.5-alt-rc1-ppc64le-ANSIBLE-netboot-compute/rootimgdir/rootimg.cpio.gz XCAT=172.12.253.27:3001 NODE=mid05tor12cn16 FC=0  LOGSERVER=172.12.253.27  syslog.server=172.12.253.27 syslog.type=rsyslogd syslog.filter=*.*  xcatdebugmode=1 BOOTIF=70:e2:84:14:73:30  selinux=0

[root@mid05tor12cn16 ~]# python
Python 2.7.5 (default, Feb 20 2018, 09:20:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> from logging.handlers import SysLogHandler
>>> h = SysLogHandler(address='/dev/log', facility=SysLogHandler.LOG_LOCAL4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/logging/handlers.py", line 761, in __init__
    self._connect_unixsocket(address)
  File "/usr/lib64/python2.7/logging/handlers.py", line 789, in _connect_unixsocket
    self.socket.connect(address)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused
>>>
[root@mid05tor12cn16 ~]# systemctl status rsyslog
● rsyslog.service - System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2018-03-30 03:31:28 EDT; 3 days ago
     Docs: man:rsyslogd(8)
           http://www.rsyslog.com/doc/
 Main PID: 4020 (rsyslogd)
   CGroup: /system.slice/rsyslog.service
           └─4020 /usr/sbin/rsyslogd -n

Mar 30 03:31:28 mid05tor12cn16.pok.stglabs.ibm.com systemd[1]: Stopping System Logging Service...
Mar 30 03:31:28 mid05tor12cn16.pok.stglabs.ibm.com systemd[1]: Starting System Logging Service...
Mar 30 03:31:28 mid05tor12cn16.pok.stglabs.ibm.com systemd[1]: Started System Logging Service.
[root@mid05tor12cn16 ~]# ls -l /dev/log
srw-rw-rw- 1 root root 0 Mar 30 03:30 /dev/log

[root@mid05tor12cn16 ~]# systemctl status systemd-journald.socket
● systemd-journald.socket - Journal Socket
   Loaded: loaded (/usr/lib/systemd/system/systemd-journald.socket; static; vendor preset: disabled)
   Active: active (running) since Fri 2018-03-30 03:30:30 EDT; 3 days ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
   Listen: /run/systemd/journal/stdout (Stream)
           /run/systemd/journal/socket (Datagram)
           /dev/log (Datagram)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

This is a diskless compute node under debug mode.

robin2008 commented 6 years ago

After some investigation: if we stop "systemd-journnald.socket" service in before switch_root, then the /dev/log will be okay.

How to do that: 1, add "rd.break=cleanup" in append line for the petitboot conf file of the diskless node, like /tftpboot/petitboot/mid05tor12cn16 2, reboot the node and run rcons for it 3, wait the node enter shell mode 4, run systemctl stop systemd-journnald.socket and then exit, exit 5, wait the node boot up, and try to open /dev/log, now it is working

[root@mid05tor12cn16 ~]# python
Python 2.7.5 (default, Feb 20 2018, 09:20:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> from logging.handlers import SysLogHandler
>>> h = SysLogHandler(address='/dev/log', facility=SysLogHandler.LOG_LOCAL4)
>>>
robin2008 commented 6 years ago

It is required to create a cleanup hook for dracut.

robin2008 commented 6 years ago

Modify rsyslog configuration in dracut to load imjournal, and then rsyslog will not close /dev/log.

robin2008 commented 6 years ago

@hu-weihua could you help to verify it? The fix will impact RHEL7 on both ppc64le and x86_64. Note: only for debugmode

hu-weihua commented 6 years ago

Have been verified by 20180412.0715 build. thanks @robin2008