xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
361 stars 171 forks source link

When physical hard drive is bad, anaconda fails in bare metal install but no good reason passed up. #5509

Open whowutwut opened 6 years ago

whowutwut commented 6 years ago

Found that during installation of RHEL on Power9 server, we hit this problem:

    raise ValueError("name already in use")
  File "/usr/lib64/python2.7/site-packages/pyanaconda/kickstart.py", line 1961, in execute
    peSize=pesize)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/kickstart.py", line 1896, in execute
    v.execute(storage, ksdata, instClass)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/kickstart.py", line 2486, in doKickstartStorage
    ksdata.volgroup.execute(storage, ksdata, instClass)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/ui/tui/spokes/storage.py", line 419, in execute
    doKickstartStorage(self.storage, self.data, self.instclass)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/ui/tui/hubs/summary.py", line 64, in setup
    spoke.execute()
  File "/usr/lib64/python2.7/site-packages/pyanaconda/ui/tui/__init__.py", line 171, in setup
    should_schedule = obj.setup(self.ENVIRONMENT)
  File "/sbin/anaconda", line 1370, in <module>
    anaconda._intf.setup(ksdata)
ValueError: name already in use

What do you want to do now?
1) Report Bug
2) Debug
3) Quit

Please make your choice from above:

After some investigation, it was determined that the drive target to be installed on had some problem and when we removed that drive, the install goes successfully on the 2nd drive in the server.

So while this is outside of xCAT, we do have logic selection of disk. I'm not sure if there's a way to detect some error on the disk ahead and avoid installing on to that bad disk.... or even bubble up a better message?? Just seeing what we could do to avoid this down time when debugging this issue...

zet809 commented 6 years ago

hi, @immarvin , it seems the issue #4876 is reproduced, pls help to take a look at. Thx!

immarvin commented 6 years ago

hi @whowutwut , the issue of this kind is what xcatdebugmode tried to solve. For Redhat 7.x, we will set "inst.loglevel=debug inst.syslog=" to tell anaconda to forward all the debug level installation logs to the syslog on . So can you find any instructive messages in /var/log/xcat/computes.log on MN? if yes, maybe we can consider whether it make sense to take xcatdebug=1/2 as the default value to avoid another provision to debug with xcatdebugmode.

Since xCAT leverages the OS shipped installer(anaconda for redhat) to perform the diskful provision, all the customization part are applied via the hooks(%pre,%post...) and configurable sections(%packages...) exposed by the installer, I do not think we can take over the installer to bubble up some customized message on the console.

The Cobbler 's solution is running a background python daemon named Anamon to upload the installation log files(anaconda.log boot.log dmesg install.log ks.cfg lvmout.log messages sys.log ) inside the installer to management node, then admin can looking these files to find out the real cause

immarvin commented 6 years ago

The cobbler 's Anamon link: http://cobbler.github.io/manuals/2.8.0/Appendix/E_-_Anaconda_Monitoring.html

robin2008 commented 6 years ago

I think xcat can start a monitoring process in initrd too, it can monitor the status of whole anaconda, and report messages for any error.

robin2008 commented 6 years ago

But how could anaconda think this is a failure disk and then xcat get the information from anaconda?

cxhong commented 5 years ago

The problem is recreated on mid08tor03cn26 on boston02. I changed xcatdebugmode to 2, in the /var/log/xcat/computes.log , it's looping this message:

Nov 29 11:24:08 mid08tor03cn26 systemd: INFO  anaconda-shell@hvc1.service has no holdoff time, scheduling restart.
Nov 29 11:24:08 mid08tor03cn26 systemd: WARNING  Cannot add dependency job for unit rhel-autorelabel.service, ignoring: Unit is masked.
Nov 29 11:24:08 mid08tor03cn26 systemd: WARNING  Cannot add dependency job for unit rhel-loadmodules.service, ignoring: Unit is masked.
Nov 29 11:24:08 mid08tor03cn26 systemd: WARNING  Cannot add dependency job for unit systemd-tmpfiles-clean.timer, ignoring: Unit is masked.
Nov 29 11:24:08 mid08tor03cn26 systemd: INFO  Started Shell on hvc1.
Nov 29 11:24:08 mid08tor03cn26 systemd: INFO  Starting Shell on hvc1...
Nov 29 11:24:08 mid08tor03cn26 agetty[4462]: ERR  /dev/hvc1: cannot open as standard input: No such device

I will save this log to /var/log/xcat/computes.log.cn26