Open whowutwut opened 6 years ago
hi, @immarvin , it seems the issue #4876 is reproduced, pls help to take a look at. Thx!
hi @whowutwut , the issue of this kind is what xcatdebugmode
tried to solve. For Redhat 7.x, we will set "inst.loglevel=debug inst.syslog=/var/log/xcat/computes.log
on MN? if yes, maybe we can consider whether it make sense to take xcatdebug=1/2
as the default value to avoid another provision to debug with xcatdebugmode
.
Since xCAT leverages the OS shipped installer(anaconda for redhat) to perform the diskful provision, all the customization part are applied via the hooks(%pre,%post...) and configurable sections(%packages...) exposed by the installer, I do not think we can take over the installer to bubble up some customized message on the console.
The Cobbler 's solution is running a background python daemon named Anamon
to upload the installation log files(anaconda.log boot.log dmesg install.log ks.cfg lvmout.log messages sys.log
) inside the installer to management node, then admin can looking these files to find out the real cause
The cobbler 's Anamon
link: http://cobbler.github.io/manuals/2.8.0/Appendix/E_-_Anaconda_Monitoring.html
I think xcat can start a monitoring process in initrd too, it can monitor the status of whole anaconda, and report messages for any error.
But how could anaconda think this is a failure disk and then xcat get the information from anaconda?
The problem is recreated on mid08tor03cn26 on boston02. I changed xcatdebugmode to 2, in the /var/log/xcat/computes.log
, it's looping this message:
Nov 29 11:24:08 mid08tor03cn26 systemd: INFO anaconda-shell@hvc1.service has no holdoff time, scheduling restart.
Nov 29 11:24:08 mid08tor03cn26 systemd: WARNING Cannot add dependency job for unit rhel-autorelabel.service, ignoring: Unit is masked.
Nov 29 11:24:08 mid08tor03cn26 systemd: WARNING Cannot add dependency job for unit rhel-loadmodules.service, ignoring: Unit is masked.
Nov 29 11:24:08 mid08tor03cn26 systemd: WARNING Cannot add dependency job for unit systemd-tmpfiles-clean.timer, ignoring: Unit is masked.
Nov 29 11:24:08 mid08tor03cn26 systemd: INFO Started Shell on hvc1.
Nov 29 11:24:08 mid08tor03cn26 systemd: INFO Starting Shell on hvc1...
Nov 29 11:24:08 mid08tor03cn26 agetty[4462]: ERR /dev/hvc1: cannot open as standard input: No such device
I will save this log to /var/log/xcat/computes.log.cn26
Found that during installation of RHEL on Power9 server, we hit this problem:
After some investigation, it was determined that the drive target to be installed on had some problem and when we removed that drive, the install goes successfully on the 2nd drive in the server.
So while this is outside of xCAT, we do have logic selection of disk. I'm not sure if there's a way to detect some error on the disk ahead and avoid installing on to that bad disk.... or even bubble up a better message?? Just seeing what we could do to avoid this down time when debugging this issue...