xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
359 stars 171 forks source link

`xcatd` service start failure on SN #6711

Open kcgthb opened 4 years ago

kcgthb commented 4 years ago

Hi,

I noticed that on freshly deployed Service Nodes, the xcatd service often fails to start with the following errors:

# journalctl -u xcatd
-- Logs begin at Thu 2020-05-14 15:03:57 PDT, end at Thu 2020-05-14 16:08:33 PDT. --
May 14 15:04:21 sh03-sn03.SUNet systemd[1]: Starting xCAT management service...
May 14 15:04:22 sh03-sn03.SUNet xcat[14853]: xcatd is going to start...
May 14 15:04:22 sh03-sn03.SUNet xcat[15277]: Error loading module /opt/xcat/lib/perl/xCAT_plugin/goconserver.pm  ...skipping
May 14 15:04:22 sh03-sn03.SUNet xcatd[14834]: Starting xcatd Error loading module /opt/xcat/lib/perl/xCAT_plugin/goconserver.pm  ...skipping
May 14 15:04:30 sh03-sn03.SUNet xcatd[14834]: Attempt to reload xCAT/Goconserver.pm aborted.
May 14 15:04:30 sh03-sn03.SUNet xcatd[14834]: Compilation failed in require at /opt/xcat/lib/perl/xCAT_plugin/AAsn.pm line 560.
May 14 15:04:30 sh03-sn03.SUNet xcatd[14834]: Magic number checking on storable file failed at /usr/lib64/perl5/vendor_perl/Storable.pm line 399, at /opt/xcat/sbin/xcatd line 977.
May 14 15:04:30 sh03-sn03.SUNet xcatd[14834]: [FAILED]
May 14 15:04:30 sh03-sn03.SUNet systemd[1]: PID file /var/run/xcatd.pid not readable (yet?) after start.

The Service Nodes are running CentOS 7, and xCAT version is:

# rpm -q xCATsn
xCATsn-2.15.1-snap202003041643.x86_64

What would be the best way to debug this? Thanks!

cxhong commented 4 years ago

no problem running xcatd on the MN?
can u run

perl -c /opt/xcat/lib/perl/xCAT_plugin/goconserver.pm

both MN and SN should have same version of xcatd, right?

rpm -qa | grep xCAT
rpm -qa | grep gocon
kcgthb commented 4 years ago

Hi @cxhong

No problem running xcatd on the MN. And even more interestingly, restarting the xcatd service on the SN after the initial failure usually works. So it's just the first start, when the SN boots, that fails.

On the SN:

[root@sh03-sn03 ~]# perl -c /opt/xcat/lib/perl/xCAT_plugin/goconserver.pm
/opt/xcat/lib/perl/xCAT_plugin/goconserver.pm syntax OK

[root@sh03-sn03 ~]# rpm -qa | grep xCAT | sort
perl-xCAT-2.15.1-snap202003041643.noarch
xCAT-client-2.15.1-snap202003041643.noarch
xCAT-genesis-base-ppc64-2.14.5-snap201811160710.noarch
xCAT-genesis-base-x86_64-2.14.5-snap201811190037.noarch
xCAT-genesis-scripts-ppc64-2.15.1-snap202003041643.noarch
xCAT-genesis-scripts-x86_64-2.15.1-snap202003041643.noarch
xCAT-probe-2.15.1-snap202003041643.noarch
xCAT-server-2.15.1-snap202003041643.noarch
xCATsn-2.15.1-snap202003041643.x86_64

[root@sh03-sn03 ~]# rpm -qa | grep gocon
goconserver-0.3.2-snap201909171103.x86_64

On the MN:

[root@sh02-hn01 ~]# rpm -qa | grep xCAT | sort
perl-xCAT-2.15.1-snap202003041643.noarch
xCAT-2.15.1-snap202003041643.x86_64
xCAT-buildkit-2.15.1-snap202003041643.noarch
xCAT-client-2.15.1-snap202003041643.noarch
xCAT-genesis-base-ppc64-2.14.5-snap201811160710.noarch
xCAT-genesis-base-x86_64-2.14.5-snap201811190037.noarch
xCAT-genesis-scripts-ppc64-2.15.1-snap202003041643.noarch
xCAT-genesis-scripts-x86_64-2.15.1-snap202003041643.noarch
xCAT-probe-2.15.1-snap202003041643.noarch
xCAT-server-2.15.1-snap202003041643.noarch

[root@sh02-hn01 ~]# rpm -qa | grep gocons
goconserver-0.3.2-snap201909171103.x86_64

So yes, same versions everywhere.

Thanks!

cxhong commented 4 years ago

is goconserver running after restart xcatd on SN? did u define setupconserver attribute for SN? can u lsdef the SN's osimage?

kcgthb commented 4 years ago

is goconserver running after restart xcatd on SN?

Yes, and it's enabled to start at boot:

[root@sh03-sn03 ~]# systemctl status goconserver
Redirecting to /bin/systemctl status goconserver.service
● goconserver.service - goconserver console daemon
   Loaded: loaded (/usr/lib/systemd/system/goconserver.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-05-14 16:22:11 PDT; 18h ago
     Docs: https://github.com/xcat2/goconserver
 Main PID: 20664 (goconserver)
   CGroup: /system.slice/goconserver.service
           └─20664 /usr/bin/goconserver

Maybe the xcatd service needs a dependency on goconserver?

did u define setupconserver attribute for SN?

Yes:

# lsdef sh03-sn03 | grep " setup"
    setupconserver=2
    setupdhcp=1
    setupipforward=1
    setupnameserver=2
    setupnfs=1
    setupntp=1
    setuptftp=1

can u lsdef the SN's osimage?

[root@sh03-sn03 ~]# lsdef -c sh03-sn03 -i provmethod
sh03-sn03: provmethod=sherlock-service

[root@sh03-sn03 ~]# lsdef -t osimage sherlock-service
Object name: sherlock-service
    addkcmdline=
    description=Sherlock OS image for service nodes
    groups=sherlock
    imagetype=linux
    netdrivers=allupdate
    osarch=x86_64
    osdistroname=centos7.6-x86_64
    osname=Linux
    osupdatename=osupdate
    osvers=centos7.6
    otherpkgdir=/install/repos
    otherpkglist=/install/custom/sherlock/lists/svc/service/otherpkglist
    partitionfile=s:/install/custom/sherlock/part_data.sh
    pkgdir=/install/centos7.6/x86_64,/install/updates
    pkglist=/install/custom/sherlock/lists/svc/service/pkglist
    profile=service
    provmethod=install
    synclists=/install/custom/sherlock/lists/_common/synclist
    template=/install/custom/sherlock/sherlock.centos7.tmpl

Thanks!

cxhong commented 4 years ago

maybe add goconserver package to the otherpkglist as for rhel8, but should pull in when install xCATsn

# cat /opt/xcat/share/xcat/install/rh/service.rhels8.x86_64.otherpkgs.pkglist
xcat/xcat-core/xCATsn
xcat/xcat-dep/rh8/x86_64/goconserver
kcgthb commented 4 years ago

The goconserver package is correctly installed on the SN and present when it boots.

The issue is that the first xcatd start fails. Successive starts of xcatd succeed.

cxhong commented 4 years ago

I couldn't recreated this issue. maybe at that time xcatd is starting but goconserver module is not finish load yet. try re-install service node again , then run this command xcatprobe osdeploy -n <servicenode name> to check if something happened from output .

kcgthb commented 4 years ago

Adding a service dependency on goconserver seems to help:

# systemctl status xcatd
● xcatd.service - xCAT management service
   Loaded: loaded (/usr/lib/systemd/system/xcatd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/xcatd.service.d
           └─deps.conf
[...]
# cat /etc/systemd/system/xcatd.service.d/deps.conf
[Unit]
After=goconserver.service

I don't know if that's entirely sufficient, though, but it seems to help for the few tries I did.

cxhong commented 4 years ago

@kcgthb , thanks, that's good finding. we maybe need to find a way to start up goconserver first if setupconserver=2, then start up xcatd on the service node.