Closed hu-weihua closed 6 years ago
Found that it caused by the syslog, and it seems the /dev/syslog issue has also reported many times.
https://unix.stackexchange.com/questions/317064/how-do-i-restore-dev-log-in-systemdrsyslog-host
Refer to the steps, we can workaround it
systemctl restart systemd-journald.socket && systemctl restart rsyslog
Not sure why xcatd need to exit when it cannot connect syslog, we can see below codes in xcatd.
unless (xCAT::Utils->isLinux()) { # messes up the output of the service cmd on linux
eval {
xCAT::MsgUtils->message("S", "xcatd: service starting");
};
}
if ($@) {
print "ERROR: $@";
xexit; <======= here exit directly when cannot log to syslog
}
After using the workaround @robin2008 mentioned above, the python version hardware control commands can work in hierarchical cluster.
setup an new environment try to recreate this issue, and not found. The osimage is created by following http://xcat-docs.readthedocs.io/en/stable/advanced/hierarchy/provision/diskless_sn.html
lsdef sn02 -i provmethod
Object name: sn02
provmethod=rhels7.5-rc-ppc64le-netboot-service
[root@sn02 ~]# systemctl status xcatd
● xcatd.service - LSB: xcatd
Loaded: loaded (/etc/rc.d/init.d/xcatd; bad; vendor preset: disabled)
Active: active (running) since Thu 2018-03-15 03:04:50 EDT; 4min 35s ago
Docs: man:systemd-sysv-generator(8)
Process: 5083 ExecStart=/etc/rc.d/init.d/xcatd start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/xcatd.service
├─5335 xcatd: SSL listener
├─5336 xcatd: DB Access
├─5337 xcatd: UDP listener
├─5338 xcatd: install monitor
├─5339 xcatd: Discovery worker
└─5340 xcatd: Command log writer
Mar 15 03:04:48 sn02.pok.stglabs.ibm.com systemd[1]: Starting LSB: xcatd...
Mar 15 03:04:49 sn02.pok.stglabs.ibm.com xcatd[5083]: Starting xcatd dns server has been enabled on boot.
Mar 15 03:04:49 sn02.pok.stglabs.ibm.com xcat[5092]: dns server has been enabled on boot.
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com xcatd[5083]: Job for httpd.service failed because the control process exited with error code. See "systemc...details.
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com xcat[5340]: INFO xcatd: command log process 5340 start
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com xcatd[5083]: [ OK ]
Mar 15 03:04:50 sn02.pok.stglabs.ibm.com systemd[1]: Started LSB: xcatd.
Hint: Some lines were ellipsized, use -l to show in full.
lsxcatd -a
Version 2.13.11 (git commit 6b51474e72575f0432973016d2114f07083e9356, built Tue Mar 13 07:45:52 EDT 2018)
This is a Service Node
cfgloc=Pg:dbname=xcatdb;host=172.10.253.27|xcatadm
Plan to export the image from briggs01 and import it to boston02 and see what's going on.
Using the valid stateless SN image on sn02/boston02, the rsyslog issue is still there. The issue might be related to environment. Need to try more P9 server.
Found tricky things there by @cxhong , when enabling xcatdebugmode, the service node cannot work as expected.
workaround the MsgUtil.pm first to make sure xcatd will not crash when /dev/log is not ready. it is addressed in https://github.com/xcat2/xcat-core/pull/4974
Later we will investigate why /dev/log is not working when debug mode is enabled.
I hit this issue again. Here is the version information MN OS: Red Hat Enterprise Linux Server 7.4 (Maipo) SN OS: Red Hat Enterprise Linux Server 7.5 (Maipo) xCAT Version
[root@sn02 ~]# lsxcatd -v
Version 2.14 (git commit f2958a9ef534a144231ae437206dbe1f3fabaeaf, built Mon Apr 2 06:15:46 EDT 2018)
Yes, with debugmode=1, we will still hit such issue, and python agent will try to log to syslog, so it exit with exception.
Traceback (most recent call last):
File "/opt/xcat/lib/python/agent/xcatagent/server.py", line 87, in _handle
req['module'])
File "/opt/xcat/lib/python/agent/xcatagent/base.py", line 16, in get_manager_func
__import__(module_name)
File "/usr/lib64/python2.7/site-packages/gevent/builtins.py", line 93, in __import__
result = _import(*args, **kwargs)
File "/opt/xcat/lib/python/agent/xcatagent/openbmc.py", line 37, in <module>
utils.enableSyslog('xcat.agent')
File "/opt/xcat/lib/python/agent/common/utils.py", line 26, in enableSyslog
h = SysLogHandler(address='/dev/log', facility=SysLogHandler.LOG_LOCAL4)
File "/usr/lib64/python2.7/logging/handlers.py", line 761, in __init__
self._connect_unixsocket(address)
File "/usr/lib64/python2.7/logging/handlers.py", line 789, in _connect_unixsocket
self.socket.connect(address)
File "/usr/lib64/python2.7/site-packages/gevent/_socket2.py", line 238, in connect
raise error(result, strerror(result))
error: [Errno 111] Connection refused
And I think it is an general issue, not only for diskless service node
[root@mid05tor12cn16 ~]# cat /proc/cmdline
imgurl=http://172.12.253.27:80//install/ansible/custom/rhels7.5-alt-rc1-ppc64le-ANSIBLE-netboot-compute/rootimgdir/rootimg.cpio.gz XCAT=172.12.253.27:3001 NODE=mid05tor12cn16 FC=0 LOGSERVER=172.12.253.27 syslog.server=172.12.253.27 syslog.type=rsyslogd syslog.filter=*.* xcatdebugmode=1 BOOTIF=70:e2:84:14:73:30 selinux=0
[root@mid05tor12cn16 ~]# python
Python 2.7.5 (default, Feb 20 2018, 09:20:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> from logging.handlers import SysLogHandler
>>> h = SysLogHandler(address='/dev/log', facility=SysLogHandler.LOG_LOCAL4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/logging/handlers.py", line 761, in __init__
self._connect_unixsocket(address)
File "/usr/lib64/python2.7/logging/handlers.py", line 789, in _connect_unixsocket
self.socket.connect(address)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused
>>>
[root@mid05tor12cn16 ~]# systemctl status rsyslog
● rsyslog.service - System Logging Service
Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-30 03:31:28 EDT; 3 days ago
Docs: man:rsyslogd(8)
http://www.rsyslog.com/doc/
Main PID: 4020 (rsyslogd)
CGroup: /system.slice/rsyslog.service
└─4020 /usr/sbin/rsyslogd -n
Mar 30 03:31:28 mid05tor12cn16.pok.stglabs.ibm.com systemd[1]: Stopping System Logging Service...
Mar 30 03:31:28 mid05tor12cn16.pok.stglabs.ibm.com systemd[1]: Starting System Logging Service...
Mar 30 03:31:28 mid05tor12cn16.pok.stglabs.ibm.com systemd[1]: Started System Logging Service.
[root@mid05tor12cn16 ~]# ls -l /dev/log
srw-rw-rw- 1 root root 0 Mar 30 03:30 /dev/log
[root@mid05tor12cn16 ~]# systemctl status systemd-journald.socket
● systemd-journald.socket - Journal Socket
Loaded: loaded (/usr/lib/systemd/system/systemd-journald.socket; static; vendor preset: disabled)
Active: active (running) since Fri 2018-03-30 03:30:30 EDT; 3 days ago
Docs: man:systemd-journald.service(8)
man:journald.conf(5)
Listen: /run/systemd/journal/stdout (Stream)
/run/systemd/journal/socket (Datagram)
/dev/log (Datagram)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
This is a diskless compute node under debug mode.
After some investigation: if we stop "systemd-journnald.socket" service in before switch_root, then the /dev/log will be okay.
How to do that:
1, add "rd.break=cleanup" in append line for the petitboot conf file of the diskless node, like /tftpboot/petitboot/mid05tor12cn16
2, reboot the node and run rcons
for it
3, wait the node enter shell mode
4, run systemctl stop systemd-journnald.socket
and then exit
, exit
5, wait the node boot up, and try to open /dev/log
, now it is working
[root@mid05tor12cn16 ~]# python
Python 2.7.5 (default, Feb 20 2018, 09:20:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> from logging.handlers import SysLogHandler
>>> h = SysLogHandler(address='/dev/log', facility=SysLogHandler.LOG_LOCAL4)
>>>
It is required to create a cleanup
hook for dracut.
Modify rsyslog configuration in dracut to load imjournal, and then rsyslog will not close /dev/log.
@hu-weihua could you help to verify it? The fix will impact RHEL7 on both ppc64le and x86_64. Note: only for debugmode
Have been verified by 20180412.0715 build. thanks @robin2008
xCAT build :
Version 2.13.11 (git commit d0ed517ee20e271e05ba6ffd06f4e3deb0c240c0, built Fri Mar 9 07:15:44 EST 2018)
MN OS:
Red Hat Enterprise Linux Server release 7.4 (Maipo)
SN OS:rhels7.5-snap2-ppc64le
For xCAT 20180309 build, after installing diskless SN, the xcatd in SN failed to start.
This issue also happened on
rhels7.4-alt-rc1
xCAT 20180308 build, no matter withxCAT-openbmc-py
or withoutxCAT-openbmc-py
.The related definitions are: