xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
359 stars 171 forks source link

Ubuntu 18.04.3 diskful installation hangs #6431

Open MasterGroosha opened 4 years ago

MasterGroosha commented 4 years ago

Hello. I'm trying to install ubuntu 18.04.3 (without modifications) to some our Supermicro servers. Downloaded iso, executed copycds command on it.

root@gate:~# lsdef -t osimage ubuntu18.04.3-x86_64-install-compute

Object name: ubuntu18.04.3-x86_64-install-compute
    imagetype=linux
    osarch=x86_64
    osname=Linux
    osvers=ubuntu18.04.3
    otherpkgdir=/install/post/otherpkgs/ubuntu18.04.3/x86_64
    pkgdir=/install/ubuntu18.04.3/x86_64
    pkglist=/opt/xcat/share/xcat/install/ubuntu/compute.ubuntu18.04.x86_64.pkglist
    profile=compute
    provmethod=install
    template=/opt/xcat/share/xcat/install/ubuntu/compute.tmpl

Compute node description: root@gate:~# lsdef mynode

Object name: mynode
    arch=x86_64
    bmc=10.0.1.1
    bmcpassword=ADMIN
    bmcusername=ADMIN
    cons=ipmi
    currchain=boot
    currstate=install ubuntu-18.04-test-x86_64-compute
    getmac=ipmi
    groups=storage
    hostnames=mynode
    installnic=mac
    ip=10.1.1.1
    mac=yy:yy:yy:xx:xx:xx
    mgt=ipmi
    netboot=xnba
    os=ubuntu-18.04-test
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    primarynic=mac
    profile=compute
    provmethod=ubuntu-18.04-test-x86_64-install-compute
    serialport=0
    serialspeed=115200
    status=installing
    statustime=09-25-2019 17:56:37
    usercomment=the system X node definition

Then making that node boot from net: nodeset mynode osimage=ubuntu-18.04-test-x86_64-install-compute rsetboot mynode net rpower mynode boot

I tried this on several different Supermicro servers and every time servers get stuck on this:

[ 7.416015 ] hid-generic....
[ 8.378447 ] mlx4_core 0000:07:00:0 Old device ETS support detected
[ 8.384764 ] mlx4_core 0000:07:00:0 Consider upgrading device FW
[ 8.193779 ] mlx4_core 0000:07:00:0 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)

I recorded the video. When booted "normally" (they already had Ubuntu 18.04.3 from the same ISO installed), after mlx4core, boot process shows something about raid6 and then proceeds to boot Ubuntu itself. So, I decided to blacklist both mlx4* and raid packages: chdef mynode -m addkcmdline="modprobe.blacklist=mlx4_core,mlx4_en,mlx4_ib"

That didn't help except I don't see mlx4_* lines during boot before it freezes (since I blacklisted it, obviously). So there's something else wrong with provisioning.

Could you please help me? Being able to install ISO remotely is critical for me at the moment and xCAT seems to be the best solution. The only thing I need is to make it work!

cxhong commented 4 years ago

which version of xcatd you are running?

can u issue this command xcatprobe osdeploy -n mynode to get output of node deployments?

MasterGroosha commented 4 years ago

@cxhong xCAT 2.14.6


root@gate:/home/user# xcatprobe osdeploy -n mynode
The install NIC in current server is enp2s0f0                                                                     [INFO]
All nodes to be deployed are valid                                                                                [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[mynode] 09:58:24 Receive DHCPDISCOVER via enp2s0f0
[mynode] 09:58:24 Send DHCPOFFER on 10.1.1.1 back to 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:58:26 DHCPREQUEST for 10.1.1.1 (10.1.1.2 (ip of MN)) from 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:58:26 Send DHCPACK on 10.1.1.1 back to 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:58:26 Via TFTP download xcat/xnba.kpxe
[mynode] 09:58:26 Via TFTP download xcat/xnba.kpxe
[mynode] 09:58:26 Receive DHCPDISCOVER via enp2s0f0
[mynode] 09:58:26 Send DHCPOFFER on 10.1.1.1 back to 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:58:26 DHCPREQUEST for 10.1.1.1 (10.1.1.2 (ip of MN)) from 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:58:26 Via HTTP get /tftpboot/xcat/xnba/nodes/mynode
[mynode] 09:58:26 Via HTTP get /tftpboot/xcat/osimage/ubuntu-18.04-test-x86_64-install-compute/vmlinuz
[mynode] 09:58:27 Via HTTP get /tftpboot/xcat/osimage/ubuntu-18.04-test-x86_64-install-compute/initrd.img
[mynode] 09:58:26 Send DHCPACK on 10.1.1.1 back to 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:59:03 Receive DHCPDISCOVER via enp2s0f0
[mynode] 09:59:03 Send DHCPOFFER on 10.1.1.1 back to 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:59:06 Via HTTP get /install/autoinst/mynode
[mynode] 09:59:07 Via HTTP get /install/autoinst/mynode.pre
[mynode] 09:59:03 DHCPREQUEST for 10.1.1.1 (10.1.1.2 (ip of MN)) from 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 09:59:03 Send DHCPACK on 10.1.1.1 back to 00:25:90:xx:xx:xx via enp2s0f0
[mynode] 10:00:11 ============deployment starting============
[mynode] 10:00:11 Running preseeding early_command Installation script...
[mynode] 10:00:11 Generate partition file...
[mynode] 09:59:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/dists/bionic/Release
[mynode] 09:59:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/dists/bionic/main/binary-amd64/Release
[mynode] 09:59:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/dists/bionic/Release
[mynode] 09:59:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/dists/bionic/main/debian-installer/binary...
[mynode] 09:59:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/dists/bionic/restricted/debian-installer/...
[mynode] 09:59:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/dists/bionic-updates/Release
[mynode] 10:20:57 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/a/apt-setup/apt-mirror-setup_0....
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/a/apt-setup/apt-setup-udeb_0.10...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/a/attr/attr-udeb_2.4.47-2build1...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/b/base-installer/base-installer...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/block-modules-5.0.0...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/b/btrfs-progs/btrfs-progs-udeb_...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/c/clock-setup/clock-setup_0.131...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/c/cryptsetup/cryptsetup-udeb_2....
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/d/debian-installer-utils/di-uti...
[mynode] 10:20:58 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/h/hw-detect/disk-detect_1.117ub...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/lvm2/dmsetup-udeb_1.02.145-4....
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/d/dosfstools/dosfstools-udeb_4....
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/h/hw-detect/driver-injection-di...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/e/e2fsprogs/e2fsprogs-udeb_1.44...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/f/finish-install/finish-install...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/firewire-core-modul...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/fs-core-modules-5.0...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/fs-secondary-module...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/f/fuse/fuse-udeb_2.9.7-1ubuntu1...
[mynode] 10:20:59 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/g/grub-installer/grub-installer...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/g/grub2/grub-mount-udeb_2.02-2u...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/ipmi-modules-5.0.0-...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/j/jfsutils/jfsutils-udeb_1.1.15...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/a/argon2/libargon2-0-udeb_0~201...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/a/attr/libattr1-udeb_2.4.47-2bu...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/libb/libbsd/libbsd0-udeb_0.8.7-...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/c/cryptsetup/libcryptsetup12-ud...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/lvm2/libdevmapper1.02.1-udeb_...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/f/fuse/libfuse2-udeb_2.9.7-1ubu...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/o/open-isns/libisns-nocrypto0-u...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/j/json-c/libjson-c3-udeb_0.12.1...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/lzo2/liblzo2-2-udeb_2.08-1.2_...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/parted/libparted-fs-resize0-u...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/parted/libparted2-udeb_3.2-20...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/popt/libpopt0-udeb_1.16-11_am...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/libu/libusb-1.0/libusb-1.0-0-ud...
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/libz/libzstd/libzstd1-udeb_1.3....
[mynode] 10:21:00 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/live-installer/live-installer...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/lvm2/lvm2-udeb_2.02.176-4.1ub...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/md-modules-5.0.0-23...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/m/mdadm/mdadm-udeb_4.1~rc1-3~ub...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/message-modules-5.0...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/r/reiserfsprogs/mkreiserfs-udeb...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/n/nobootloader/nobootloader_1.5...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/n/ntfs-3g/ntfs-3g-udeb_2017.3.2...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/o/open-iscsi/open-iscsi-udeb_2....
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/o/os-prober/os-prober-udeb_1.74...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/parport-modules-5.0...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partconf/partconf-mkfstab_1.5...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-auto/partman-auto_134...
[mynode] 10:21:01 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-auto-crypto/partman-a...
[mynode] 10:21:02 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-auto-loop/partman-aut...
[mynode] 10:21:02 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-auto-lvm/partman-auto...
[mynode] 10:21:02 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-auto-raid/partman-aut...
[mynode] 10:21:02 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-base/partman-base_192...
[mynode] 10:21:03 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-basicfilesystems/part...
[mynode] 10:21:03 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-basicmethods/partman-...
[mynode] 10:21:03 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-btrfs/partman-btrfs_2...
[mynode] 10:21:03 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-crypto/partman-crypto...
[mynode] 10:21:04 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-crypto/partman-crypto...
[mynode] 10:21:04 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-efi/partman-efi_71ubu...
[mynode] 10:21:04 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-ext3/partman-ext3_86u...
[mynode] 10:21:04 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-iscsi/partman-iscsi_4...
[mynode] 10:21:05 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-jfs/partman-jfs_52_al...
[mynode] 10:21:05 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-lvm/partman-lvm_123_a...
[mynode] 10:21:06 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-md/partman-md_86_all....
[mynode] 10:21:06 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-partitioning/partman-...
[mynode] 10:21:06 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-swapfile/partman-swap...
[mynode] 10:21:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-target/partman-target...
[mynode] 10:21:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-base/partman-utils_19...
[mynode] 10:21:07 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/partman-xfs/partman-xfs_63_al...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/pata-modules-5.0.0-...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/pcmcia-storage-modu...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/p/pkgsel/pkgsel_0.43ubuntu2_all...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/plip-modules-5.0.0-...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/ppp-modules-5.0.0-2...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/r/rdate/rdate-udeb_1.2-6_amd64....
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/sata-modules-5.0.0-...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-firmware/scsi-firmware_...
[mynode] 10:21:08 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/l/linux-hwe/scsi-modules-5.0.0-...
[mynode] 10:21:09 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/t/tzsetup/tzsetup-udeb_0.94ubun...
[mynode] 10:21:09 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/u/usbutils/usbutils-udeb_007-4b...
[mynode] 10:21:09 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/u/user-setup/user-setup-udeb_1....
[mynode] 10:21:09 Via HTTP get /install/ubuntu-18.04-test/x86_64/pool/main/x/xfsprogs/xfsprogs-udeb_4.9.0+...
60 minutes have expired, stop monitoring                                                                          [INFO]
======================  Summary  =====================
There is 1 node provision failures
mynode : stop at stage 'download_kickstart'     

On display attached to that server I see the same lines as in first post (with mlx4_core as last one)

cxhong commented 4 years ago

the image you showed me is ubuntu18.04.3-x86_64-install-compute, and node definition has image ubuntu-18.04-test-x86_64-install-compute, just want to make sure those are the same images.

we doesn't have x86_64 Supermicro servers, do u know if the system works for older version of ubuntu?

MasterGroosha commented 4 years ago

I haven't tried Ubuntu 16.04 yet (since I really need exactly 18.04), but these servers are already running Ubuntu 18.04.3 installed manually without any issues. Only trying to re-install via xCAT produces such problem. Also ubuntu-18.04-test is just my name for image, basically it uses this iso from ubuntu official website

root@gate:~# lsdef -t osimage ubuntu-18.04-test-x86_64-install-compute
Object name: ubuntu-18.04-test-x86_64-install-compute
    imagetype=linux
    osarch=x86_64
    osname=Linux
    osvers=ubuntu-18.04-test
    otherpkgdir=/install/post/otherpkgs/ubuntu-18.04-test/x86_64
    pkgdir=/install/ubuntu-18.04-test/x86_64
    pkglist=/opt/xcat/share/xcat/install/ubuntu/compute.pkglist
    profile=compute
    provmethod=install
    template=/opt/xcat/share/xcat/install/ubuntu/compute.tmpl
cxhong commented 4 years ago

there should be files the this compute node at /tftpboot/xcat/xnba/nodes, can u cat them?

cxhong commented 4 years ago

just look the your osimage, this will be problem:

pkglist=/opt/xcat/share/xcat/install/ubuntu/compute.pkglist

needs to be

pkglist=/opt/xcat/share/xcat/install/ubuntu/compute.ubuntu18.04.x86_64.pkglist

ubuntu18.04.3-x86_64-install-compute has correct pkglist, ubuntu-18.04-test-x86_64-install-compute has different pkglist and it's incorrect

MasterGroosha commented 4 years ago
root@gate:/tftpboot/xcat/xnba/nodes# cat mynode
#!gpxe
#install ubuntu-18.04-test-x86_64-compute
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/ubuntu-18.04-test-x86_64-install-compute/vmlinuz
imgload kernel
imgargs kernel nofb utf8 auto url=http://${next-server}:80/install/autoinst/mynode xcatd=${next-server} mirror/http/hostname=${next-server}:80 netcfg/choose_interface=00:25:90:xx:xx:xx console=tty0 console=ttyS0,115200 locale=en_US priority=critical hostname=mynode live-installer/net-image=http://${next-server}:80/install/ubuntu-18.04-test/x86_64/install/filesystem.squashfs BOOTIF=01-${netX/machyp}
imgfetch http://${next-server}:80/tftpboot/xcat/osimage/ubuntu-18.04-test-x86_64-install-compute/initrd.img
imgexec kernel

root@gate:/tftpboot/xcat/xnba/nodes# cat mynode.elilo
default="xCAT"
delay=0

image=/tftpboot/xcat/osimage/ubuntu-18.04-test-x86_64-install-compute/vmlinuz
   label="xCAT"
   initrd=/tftpboot/xcat/osimage/ubuntu-18.04-test-x86_64-install-compute/initrd.img
   append="nofb utf8 auto url=http://%N:80/install/autoinst/mynode xcatd=%N mirror/http/hostname=%N:80 netcfg/choose_interface=00:25:90:xx:xx:xx console=tty0 console=ttyS0,115200 locale=en_US priority=critical hostname=mynode live-installer/net-image=http://%N:80/install/ubuntu-18.04-test/x86_64/install/filesystem.squashfs BOOTIF=%B"

root@gate:/tftpboot/xcat/xnba/nodes# cat mynode.uefi
#!gpxe
chain http://${next-server}:80/tftpboot/xcat/elilo-x64.efi -C /tftpboot/xcat/xnba/nodes/mynode.elilo

@cxhong

just look the your osimage, this will be problem:

Hmm.. I didn't edit any pkglist variables, should it be edited manully? anyway, I'll try to edit just like you said. Where should I edit it?

MasterGroosha commented 4 years ago

I did chdef -t osimage ubuntu-18.04-test-x86_64-install-compute pkglist=/opt/xcat/share/xcat/install/ubuntu/compute.ubuntu18.04.x86_64.pkglist, hope this is the right one

MasterGroosha commented 4 years ago

btw, that didn't help. Same error as before.

cxhong commented 4 years ago

@MasterGroosha , any luck here? did u try to use older version of ubuntu release yet?

MasterGroosha commented 4 years ago

@cxhong Sorry for the long delay! Recently I was able to do some experiments on servers, three different ones from Supermicro.

In 1/3 cases installation of both 16.04.6 and 18.04.3 finished correctly. However in 2/3 cases both 16.04.6 and 18.04.3 installation hung completely.

I was monitoring all installations with xcatprobe osdeploy -n {nodename} command, the output was trimmed a bit (osdeploy doesn't support -w argument), however in all "failure" cases installation hung after these line:

Via HTTP get /install/ubuntu18.04.3/x86_64/pool/main/g/grub-gfxpayload-lists/grub-gfxpayloa...
Via HTTP get /install/ubuntu16.04.6/x86_64/pool/main/g/grub2/grub-pc_2.02%7ebeta2-36ubuntu3...

I copied these lines from different terminals and from different OS installations, as you have noticed. I guess there's something with GRUB, that it cannot install. Any ideas?

cxhong commented 4 years ago

wonder if there are configure issue or multiple DHCP server , can u run:

xcatprobe xcatmn -i enp2s0f0
xcatprobe detect_dhcpd -i enp2s0f0 -m 00:25:90:xx:xx:xx
tabdump networks
nslookup mynode
MasterGroosha commented 4 years ago
root@gate:~# xcatprobe xcatmn -i enp2s0f0
[mn]: Checking all xCAT daemons are running...  [ OK ]
[mn]: Checking xcatd can receive command request...   [ OK ]
[mn]: Checking 'site' table is configured...  [ OK ]
[mn]: Checking provision network is configured...  [FAIL]
[mn]: The IP 10.1.3.101
 10.1.3.201   10.1.1.1 of enp2s0f0 doesn't equal the value of 'master' in 'site' table                                         
[mn]: IP 10.1.3.101
 10.1.3.201   10.1.1.1 of enp2s0f0 doesn't belong to any network defined in 'networks' table                                   =================================== SUMMARY ====================================
[MN]: Checking on MN...   [FAIL]
    Checking provision network is configured...  [FAIL]
        The IP 10.1.3.101
        IP 10.1.3.101

That's strange "error", since enp2s0f0 has all 3 IP: 10.1.1.1, 10.1.3.101 and 10.1.3.201 and the current server is reachable on all of them.

root@gate:~# xcatprobe detect_dhcpd -i enp2s0f0 -m 00:25:90:xx:xx:xx
Start to detect DHCP, please wait 10 seconds                                                                      [INFO]
++++++++++++++++++++++++++++++++++                                                                                [INFO]
There are 1 servers replied to dhcp discover.                                                                     [INFO]
    Server:10.1.3.101 assign IP [10.1.3.2]. The next server is [10.1.3.101]!                                      [INFO]
++++++++++++++++++++++++++++++++++   

Looks ok, doesn't it?

root@gate:~# tabdump networks
#netname,net,mask,mgtifname,gateway,dhcpserver,tftpserver,nameservers,ntpservers,logservers,dynamicrange,staticrange,staticrangeincrement,nodehostname,ddnsdomain,vlanid,domain,mtu,comments,disable
"10_0_0_0-255_0_0_0","10.0.0.0","255.0.0.0","enp2s0f0","<xcatmaster>",,"<xcatmaster>",,,,,,,,,,,"1500",,
"192_168_1_0-255_255_255_0","192.168.1.0","255.255.255.0","enp2s0f1","192.168.1.1",,"<xcatmaster>",,,,,,,,,,,"1500",,

Also looks ok.

root@gate:~# nslookup mynode
Server:         10.1.1.1
Address:        10.1.1.1#53

Name:   mynode
Address: 10.1.3.2
MasterGroosha commented 4 years ago

There's another "strange" thing (at least, for me). After I sent previous comment (the last one before yours), I managed to "install" a node once. However to do that, I had to boot from GParted Live USB and erase all partitions on all disks. Otherwise installation would hang on/after grub-gfxpayload. I guess some postinstall script is executed which hangs the installation.

cxhong commented 4 years ago

one interface has three ip address? you didn't use vlan to manage them? looks like xCAT is confused the master ip address. It will get into race condition and that will cause provision issue as you seen, sometime works and sometime failed. you should use virtual interface as:

 enp2s0f0    10.1.1.1
 enp2s0f0:0    10.1.3.101
enp2s0f0:1    10.1.3.201

so, the site master should be 10.1.1.1 and dhcpinterface should be enp2s0f0

MasterGroosha commented 4 years ago

@cxhong Excuse me, could you please explain a bit further what do you mean like "set virtual interface"? Where should I set it? Currently I'm using netplan to control network on all servers.

cxhong commented 4 years ago

not network expert, check this link to see if can help. https://www.tecmint.com/create-multiple-ip-addresses-to-one-single-network-interface/

how did u use netplan? can u show me output of ip addr show?

MasterGroosha commented 4 years ago

@cxhong Here's my netplan config for "gate" server (which is a main node for xCAT)

network:
    ethernets:
        enp2s0f0:
          dhcp4: false
          addresses: [ 10.1.3.101/8, 10.1.3.201/8, 10.1.1.1/8 ]
          nameservers:
            addresses: [10.1.1.1]
            search: [ {my local network domain} ]                                                                                                                                                                                                                                                                                                                                                                    
    version: 2  
MasterGroosha commented 4 years ago

Not sure I've solved the issue, however changing master, domain etc. to 10.1.1.1 (which is nameserver address) without even touching makedhcp or anything like that did the trick. Again, not sure this is the right way.

cxhong commented 4 years ago

Three ip address with same net mask, that will cause xCAT confuse. Inside of xCAT code, it use getipaddress(), gethostname(), ifconfig, or ip addr command. xCAT may not parse correct if multiple values return back. I suggest at least you should use different netmask for each.