xcp-ng / xcp

Entry point for issues and wiki. Also contains some scripts and sources.
https://xcp-ng.org
1.32k stars 74 forks source link

Install RAID Creation Fails: sgdisk Failing Successfully #638

Open lethedata opened 9 months ago

lethedata commented 9 months ago

During installation RAID creation there were no failure messages but the raid device was not appearing and selected disks still appeared as individual disks. Dropping to console I was able to manually create the RAID but wiping it, rebooting, and letting the installer handle it still failed.

Looking at the host-installer specifically diskutil.py create_raid I manually ran each command. Doing this I was able to see that sgdisk --zap-all was "failing successfully". The program seemed to complete without error however it didn't properly wipe the GPT tables so the system was auto recovering. I'm not exactly clear why this was impacting the rest of the raid creation but after manually wiping with gdisk's zap I had no other issues.

Looking online I found Ubuntu gdisk Bug 1303903 which mentions that this might be caused by how sgdisk handles MBR disks and incorrectly assuming an MBR disk. Following their work around, adding the --mbrtogpt --clear flags might prevent this from happening. I was unable to reproduce after my gdisk wipe so am unable to verify.

olivierlambert commented 9 months ago

Hi! Thanks for the report.

@ydirson is it related to the PR we made a while ago and Citrix/XS never wanted to merge?

ydirson commented 9 months ago

@olivierlambert Could be. This failing command is in our RAID-creation code that that XS will not merge, more specifically about https://github.com/xcp-ng/host-installer/pull/7 which added the sgdisk --zap-all call. It could be that https://github.com/xenserver/host-installer/pull/38 would help, notably by stopping the OS from auto-assembling pre-existing RAID volumes.

@lethedata I'm interested in the logs for this "successful failing" here if you still have them, maybe we can improve the behavior here.

lethedata commented 9 months ago

@lethedata I'm interested in the logs for this "successful failing" here if you still have them, maybe we can improve the behavior here.

@ydirson Unfortunately I didn't think to grab any output until after I fixed things. For future reference, if the iso logs, where does it output to when install not completed?

I messed around trying to reproduce the error but was only able to get sgdisk to push a GPT restore once. The issue is that I don't know what was written in the original backup GPT sector leading sgdisk to constantly detect MBR after restore. This process also doesn't seem to impact mdadm through the installer.

  1. Clean disk completely
  2. Create MBR disk: fdisk /dev/DISK (Options: o, n, p, default, default, w)
  3. Backup MBR : dd if=/dev/DISK of=/PATH/MBR.backup bs=512 count=1
  4. Create GPT tabl: fdisk /dev/DISK (Options: g, w)
  5. Delete front GPT : dd if=/dev/zero of=/dev/DISK bs=512 count=34
  6. Restore MBR : dd if=/PATH/MBR.backup of=/dev/DISK bs=512 count=1
  7. Zap Drive: sgdisk --zap-all /dev/DISK

My hunch is that whatever was written in the back of the drive was a "perfect sequence" leading sgdisk to not wipe and mdadm to fail but this is just a hunch. No matter what I tried I couldn't seem to reproduce it. Now I know when it comes to odd disk issues it's probably a good idea to at least backup the tables before wiping things.

ydirson commented 9 months ago

For future reference, if the iso logs, where does it output to when install not completed?

During the installation it logs essentially into /tmp/install-log. Then you'll find the installer logs on the installed host in /var/log/installer/