xcp-ng / xcp

Entry point for issues and wiki. Also contains some scripts and sources.
https://xcp-ng.org
1.28k stars 74 forks source link

Fresh install fails creating boot, log and swap partitions on some servers #149

Closed stormi closed 2 years ago

stormi commented 5 years ago

Update: bug report renamed to reflect the root issue.

We noticed that there are two log rotation schemes for /var/log/xensource.log.

Some servers just have xensource.log and xensource.log.1.gz whereas other servers have xensource.log, xensource.log.1, xensource.log.2.gz, xensource.log.3.gz, etc.

The first ones (legacy setup) have:

And the others ("modern" setup) have:

What governs the difference is apparently this script: /etc/firstboot.d/95-legacy-logrotate which contains:

#!/bin/bash
##
# Switch to custom logrotate cronjob for legacy partition layout

set -e

. ${XENSOURCE_INVENTORY}

disable_job () {
    [ -e "$1" ] && mv "$1" "${1}"~
}

start() {
    if ! grep -q /var/log /proc/mounts; then
        if [ -e /etc/cron.d/logrotate.cron.rpmsave ]; then
            mv /etc/cron.d/logrotate.cron.rpmsave /etc/cron.d/logrotate.cron
            disable_job /etc/cron.daily/logrotate
            disable_job /etc/cron.d/xapi-logrotate.cron
            sed -ne '\@/var/log/xensource.log@,$ p' /etc/xensource/xapi-logrotate.conf >/etc/logrotate.d/xapi
        fi
    fi
}

case $1 in
    start)  start ;;
esac

It says it switches to the "legacy" setup only if you've got a legacy partition layout, but we found this to be false. A server that has been upgraded from XS 7.4 (or 7.5) to XS 7.6 and then to XCP-ng 7.6 shows the legacy setup.

We need to find out why and clean-up this mess. We don't want two competing logrotate setups.

bplessis commented 5 years ago

In my setup (three pools, one that is an upgrade from old xenserver 6, one that was reinstalled around xenserver 7 and then upgraded and one where the hosts were just reinstalled with XCP 7.6) i got:

logrotate -d /etc/logrotate.conf show that logrotate want to rotate most of the files the /etc/logrotate.d/xapi is available only on the latest

It says it switches to the "legacy" setup only if you've got a legacy partition layout From what i can see the "legacy" setup is one without a dedicated "/var/log" partition, which is the case for all my installs.

stormi commented 5 years ago

I don't think /etc/logrotate.conf is used at all if the custom logrotate cron job is used instead of the standard one.

bplessis commented 5 years ago

/opt/xensource/bin/logrotate-xenserver does weird things indeed but it does call logrotate with the same "include /etc/logrotate.d" directive, however there is manual clean pass that can explain the weirdness.

I think there is an installation issue with the most recent XCP-ng 7.6 (or at least the way i did it) which prevent the creating of the dedicated /var/log partition in some cases

bplessis commented 5 years ago

Hum, just found out that my reinstallation process in XCP-ng 7.6 did not recreate the partition table using GPT.... It did create the two 20Gb root volume but using msdos partition table, maybe because of the fat16 diagnostic partition preinstalled by Dell ?

stormi commented 5 years ago

Our hosts that use the "legacy" custom logrotate configuration are Dell servers that don't have a dedicated /var/log partition either here.

stormi commented 5 years ago

Similar hardware + fresh install of XS 7.6 two weeks ago = same issue.

bplessis commented 5 years ago

does parted show an "msdos" type partition also ? with a fat16 "diag" partition ?

Example:

Model: DELL PERC H730 Mini (scsi)
Disk /dev/sda: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      32,3kB  41,1MB  41,1MB  primary  fat16        diag
 3      4336MB  23,7GB  19,3GB  primary  ext3
 2      23,7GB  43,0GB  19,3GB  primary  ext3         boot
 4      44,1GB  250GB   205GB   primary               lvm

All other servers (dell also) that are ok have no traces of the diag partition, i think i deleted it before some way or another:

Model: DELL PERC H310 (scsi)
Disk /dev/sda: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: pmbr_boot

Number  Start   End     Size    File system     Name  Flags
 5      1049kB  4296MB  4295MB  ext3
 2      4296MB  23,6GB  19,3GB  ext3
 1      23,6GB  43,0GB  19,3GB  ext3
 4      43,0GB  43,5GB  537MB                         bios_grub, legacy_boot
 6      43,5GB  44,6GB  1074MB  linux-swap(v1)
 3      44,6GB  250GB   205GB                         lvm
stormi commented 5 years ago

Yes

Model: ATA TOSHIBA DT01ACA1 (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      32.3kB  41.1MB  41.1MB  primary  fat16        diag
 3      4336MB  23.7GB  19.3GB  primary  ext3
 2      23.7GB  43.0GB  19.3GB  primary  ext3         boot
 4      44.1GB  1000GB  956GB   primary               lvm
stormi commented 5 years ago

So there's no swap either apparently.

bplessis commented 5 years ago

Yes, here also. Well no swap "partition" but there is a swap file:

[root@xs01 ~]# swapon -s
Filename                Type        Size    Used    Priority
/var/swap/swap.001                      file    524284  76920   -1
stormi commented 5 years ago

Maybe we can find useful information in /var/log/installer/install-log

stormi commented 5 years ago

I think I've got the relevant part (this is on a server that already had that broken partition scheme. It looks like it wanted to create a better one but the result was not as expected):

INFO     [2019-02-12 15:53:31] ran ['/sbin/sfdisk', '-LluS', '/dev/sda']; rc 0
STANDARD OUT:

Disk /dev/sda: 121601 cylinders, 255 heads, 63 sectors/track
Units: sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1            63     80324      80262  de  Dell Utility
/dev/sda2   *  46217669  83966404   37748736  83  Linux
/dev/sda3       8468933  46217668   37748736  83  Linux
/dev/sda4      86063557 1953520064 1867456508  8e  Linux LVM

INFO     [2019-02-12 15:53:31] ran ['/sbin/sfdisk', '-Ld', '/dev/sda']; rc 0
STANDARD OUT:
# partition table of /dev/sda
unit: sectors

/dev/sda1 : start=       63, size=    80262, Id=de
/dev/sda2 : start= 46217669, size= 37748736, Id=83, bootable
/dev/sda3 : start=  8468933, size= 37748736, Id=83
/dev/sda4 : start= 86063557, size=1867456508, Id=8e

INFO     [2019-02-12 15:53:31] Input to sfdisk:
unit: sectors

/dev/sda1 : start=63, size=80262, Id=de
/dev/sda2 : start=46217669, size=37748736, Id=83
/dev/sda3 : start=8468933, size=37748736, Id=83
/dev/sda4 : start=86063557, size=1867456508, Id=8e
/dev/sda5 : start=0, size=0, Id=0
/dev/sda6 : start=80325, size=8388608, Id=83
/dev/sda7 : start=83966405, size=2097152, Id=82

INFO     [2019-02-12 15:53:31] sfdisk command: /sbin/sfdisk -LuS --no-reread -f /dev/sda
INFO     [2019-02-12 15:53:31] ran ['/sbin/udevadm', 'settle', '--timeout=30']; rc 0
INFO     [2019-02-12 15:53:32] Output from sfdisk:

Disk /dev/sda: 121601 cylinders, 255 heads, 63 sectors/track
Old situation:
Units: sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1            63     80324      80262  de  Dell Utility
/dev/sda2   *  46217669  83966404   37748736  83  Linux
/dev/sda3       8468933  46217668   37748736  83  Linux
/dev/sda4      86063557 1953520064 1867456508  8e  Linux LVM
New situation:
Units: sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1            63     80324      80262  de  Dell Utility
/dev/sda2      46217669  83966404   37748736  83  Linux
/dev/sda3       8468933  46217668   37748736  83  Linux
/dev/sda4      86063557 1953520064 1867456508  8e  Linux LVM
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
stormi commented 5 years ago

Very suspicious: /dev/sda5 : start=0, size=0, Id=0, don't you think?

bplessis commented 5 years ago

that is the least we can say XD

stormi commented 5 years ago

According to the logs that empty partition is meant to be the boot partition.

stormi commented 5 years ago

Bug reported to XenServer team too: https://bugs.xenserver.org/browse/XSO-940

stormi commented 5 years ago

And... according to Citrix, this is... by design. I don't know what to think.

stormi commented 5 years ago

So, apparently the reason is that removing the Dell partition voids the warranty, and we can't safely use extended partitions on DOS-style partitions in XS/XCP-ng, so we're limited to 4 partitions.

That input to sfdisk with 7 partitions and output with only 4 is still kind of troubling, but it does work in the end if that's the expected result.

And then it causes the logrotate custom configuration, that is safer when there's no log partition, but it's too bad that we have to keep it. I would have preferred to keep only one supported configuration for that.

I think the installer should warn when such a situation is met and let you wipe the utility partition altogether if you wish so, rather than decide for you.

bplessis commented 5 years ago

And... according to Citrix, this is... by design. I don't know what to think.

Well, I'm not very surprised ....

removing the Dell partition voids the warranty

that's a new one, the same tools that are on the utility partition are available on a downloadable CD on the website, and there is/was even a way to rebuild it from the integrated server management system.

we can't safely use extended partitions on DOS-style partitions in XS/XCP-ng, so we're limited to 4 partitions.

I'm tempted to ask why ? and also didn't they heard of a 'new' disk handling system called LVM ?

stormi commented 5 years ago

Turning this into an enhancement request: have the installer offer to remove the Dell partition and use a proper layout rather than this silent and surprising half-partitioning.

jradmacher commented 5 years ago

removing the Dell partition voids the warranty

that's a new one, the same tools that are on the utility partition are available on a downloadable CD on the website, and there is/was even a way to rebuild it from the integrated server management system.

tldr.: If your Server isn't ancient, you can press F10 at boot for those tools and more.

The tools have been integrated into the Lifecycle Controller, which is part of iDrac now. Even the low budget servers like a T140 have this today. Starting with the Rx20 (R520,620,etc.) Series iDrac is soldered on and "iDrac Basic" or higher is included. Remote KVM requires a license, everything else does not. So, why did you need this partition? Dell requires hardware logs to process your warranty request, which are generated by running those tools. (or for some errors automatically by iDrac.)

rjt commented 5 years ago

This would not be the first time this came up in the last twenty years. How do other installers work around this?

Seems heavy handed to delete a partition that someone took the time to install. Will a warning be popped up?

Would deleting the utility partition on an older machine delete the license keys? Break remote access?

On Tue, Jun 18, 2019 at 10:17 AM jradmacher notifications@github.com wrote:

removing the Dell partition voids the warranty

that's a new one, the same tools that are on the utility partition are available on a downloadable CD on the website, and there is/was even a way to rebuild it from the integrated server management system.

tldr.: If your Server isn't ancient, you can press F10 at boot for those tools and more.

The tools have been integrated into the Lifecycle Controller, which is part of iDrac now. Even the low budget servers like a T140 have this today. Starting with the Rx20 (R520,620,etc.) Series iDrac is soldered on and "iDrac Basic" or higher is included. Remote KVM requires a license, everything else does not. So, why did you need this partition? Dell requires hardware logs to process your warranty request, which are generated by running those tools. (or for some errors automatically by iDrac.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xcp-ng/xcp/issues/149?email_source=notifications&email_token=AACX7F2KBINKSP5JOWWT6D3P3D37FA5CNFSM4G22PJHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX67KOI#issuecomment-503182649, or mute the thread https://github.com/notifications/unsubscribe-auth/AACX7FYT2FIPZAELYPTFFLLP3D37FANCNFSM4G22PJHA .

rjt commented 5 years ago

Could the installer move the partition?

On Tue, Jun 18, 2019 at 1:54 PM Rob Townley rob.townley@gmail.com wrote:

This would not be the first time this came up in the last twenty years. How do other installers work around this?

Seems heavy handed to delete a partition that someone took the time to install. Will a warning be popped up?

Would deleting the utility partition on an older machine delete the license keys? Break remote access?

On Tue, Jun 18, 2019 at 10:17 AM jradmacher notifications@github.com wrote:

removing the Dell partition voids the warranty

that's a new one, the same tools that are on the utility partition are available on a downloadable CD on the website, and there is/was even a way to rebuild it from the integrated server management system.

tldr.: If your Server isn't ancient, you can press F10 at boot for those tools and more.

The tools have been integrated into the Lifecycle Controller, which is part of iDrac now. Even the low budget servers like a T140 have this today. Starting with the Rx20 (R520,620,etc.) Series iDrac is soldered on and "iDrac Basic" or higher is included. Remote KVM requires a license, everything else does not. So, why did you need this partition? Dell requires hardware logs to process your warranty request, which are generated by running those tools. (or for some errors automatically by iDrac.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xcp-ng/xcp/issues/149?email_source=notifications&email_token=AACX7F2KBINKSP5JOWWT6D3P3D37FA5CNFSM4G22PJHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX67KOI#issuecomment-503182649, or mute the thread https://github.com/notifications/unsubscribe-auth/AACX7FYT2FIPZAELYPTFFLLP3D37FANCNFSM4G22PJHA .

borzel commented 5 years ago

hmmm... if I own/rent a server I want to use it in the way I need. Why should the deleting of a dell utility partition void any waranty? If the harddrive is replaced you also have no longer this partition. So no big deal here.

Your remote access is not driven from that partition. So no problem.

License keys -> can you explain?

jradmacher commented 5 years ago

Your remote access is not driven from that partition. So no problem.

The important part for this discussion is that hardware diagnostics is driven by iDrac. You as a user do not lose any functionality, if the partition is deleted. So if the user specifies "clean install" and acknowledges that this deletes "all data" the installer should really wipe the HD.

License keys -> can you explain?

In the "old" days iDrac was an expansion card, which evolved to a very small module. This card "included" the license. This was last used in the Rx10 series. Anything newer has this "module" soldered to the mainboard. Hardware tests are included in this module and are free. Only remote access requires an iDrac Enterprise license.

borzel commented 5 years ago

I think there is a consense in wiping all the data on the disk.

stormi commented 2 years ago

I haven't heard that the issue still happens on a fresh install of XCP-ng 8.2, so I'm closing this issue.

Feel free to reopen if you reproduce.