xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
359 stars 171 forks source link

ubuntu 18.04.4 and 18.04.5 hangs at initrd #6811

Open CJCShadowsan opened 4 years ago

CJCShadowsan commented 4 years ago

Hi,

Both an installation of 18.04.4 (which was DGX-OS 4.5) and 18.04.5 (official ubuntu) both hang at the initrd stage:

[root@service01 install]# xcatprobe osdeploy -n gpu05
The install NIC in current server is bond0.101                                                                    [INFO]
All nodes to be deployed are valid                                                                                [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[gpu05] 13:50:30 Receive DHCPDISCOVER via bond0.101
[gpu05] 13:50:30 Send DHCPOFFER on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:33 DHCPREQUEST for 172.17.4.5 (172.17.1.211) from d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:33 Send DHCPACK on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:33 Via TFTP download xcat/xnba.efi
[gpu05] 13:50:34 Via TFTP download xcat/xnba.efi
[gpu05] 13:50:34 Receive DHCPDISCOVER via bond0.101
[gpu05] 13:50:34 Send DHCPOFFER on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:34 DHCPREQUEST for 172.17.4.5 (172.17.1.211) from d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:34 Send DHCPACK on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/xnba/nodes/gpu05.uefi
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/elilo-x64.efi
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/xnba/nodes/gpu05.elilo
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
^CGet INT or TERM signal from STDIN
======================  Summary  =====================
There is 1 node provision failures
gpu05 : stop at stage 'download_initrd'                                                                           [FAIL]

And both go no further. They also reboot, and the process continues ad infinitum so it's clear something is crashing. Is there some sort of memory limitation on initrd?

[root@service01 install]# xcatprobe osdeploy -n gpu05
The install NIC in current server is bond0.101                                                                    [INFO]
All nodes to be deployed are valid                                                                                [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[gpu05] 13:50:30 Receive DHCPDISCOVER via bond0.101
[gpu05] 13:50:30 Send DHCPOFFER on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:33 DHCPREQUEST for 172.17.4.5 (172.17.1.211) from d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:33 Send DHCPACK on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:33 Via TFTP download xcat/xnba.efi
[gpu05] 13:50:34 Via TFTP download xcat/xnba.efi
[gpu05] 13:50:34 Receive DHCPDISCOVER via bond0.101
[gpu05] 13:50:34 Send DHCPOFFER on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:34 DHCPREQUEST for 172.17.4.5 (172.17.1.211) from d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:34 Send DHCPACK on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/xnba/nodes/gpu05.uefi
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/elilo-x64.efi
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/xnba/nodes/gpu05.elilo
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
[gpu05] 13:50:34 Via HTTP get /tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
[gpu05] 13:53:37 Receive DHCPDISCOVER via bond0.101
[gpu05] 13:53:37 Send DHCPOFFER on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:53:40 DHCPREQUEST for 172.17.4.5 (172.17.1.211) from d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:53:40 Send DHCPACK on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:53:40 Via TFTP download xcat/xnba.efi
[gpu05] 13:53:41 Via TFTP download xcat/xnba.efi
[gpu05] 13:53:41 Receive DHCPDISCOVER via bond0.101
[gpu05] 13:53:41 Send DHCPOFFER on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:53:41 DHCPREQUEST for 172.17.4.5 (172.17.1.211) from d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:53:41 Send DHCPACK on 172.17.4.5 back to d8:c4:97:b8:32:cb via bond0.101
[gpu05] 13:53:41 Via HTTP get /tftpboot/xcat/xnba/nodes/gpu05.uefi
[gpu05] 13:53:41 Via HTTP get /tftpboot/xcat/elilo-x64.efi
[gpu05] 13:53:41 Via HTTP get /tftpboot/xcat/xnba/nodes/gpu05.elilo
[gpu05] 13:53:41 Via HTTP get /tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
[gpu05] 13:53:42 Via HTTP get /tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
^CGet INT or TERM signal from STDIN
======================  Summary  =====================
There is 1 node provision failures
gpu05 : stop at stage 'download_initrd'                                                                           [FAIL]

OS image used is http://cdimage.ubuntu.com/releases/18.04.5/release/ubuntu-18.04.5-server-amd64.iso

This is pretty breaking - i'm trying to deploy ubuntu from a centos server and it's needed.

Note that genesis boots fine on this DGX-1 server - so this is a definite ubuntu and xCAT issue.

CJCShadowsan commented 4 years ago

The elilo in question for 18.04.5:

[root@service01 install]# cat /tftpboot/xcat/xnba/nodes/gpu05.elilo
default="xCAT"
delay=0

image=/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
   label="xCAT"
   initrd=/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
   append="nofb utf8 auto url=http://%N:80/install/autoinst/gpu05 xcatd=%N mirror/http/hostname=%N:80 netcfg/choose_interface=d8:c4:97:b8:32:cb console=tty0 console=ttyS0,115200 locale=en_US priority=critical hostname=gpu05 live-installer/net-image=http://%N:80/install/ubuntu18.04.5/x86_64/install/filesystem.squashfs rd.drive.blacklist=nouveau nouveau.modeset=0 ask_detect=false ramdisk_size=100000 nvme-core.multipath=n  BOOTIF=%B"

I also get zero output from any tty to aid in debugging - which I appreciate is definitely not ideal!

CJCShadowsan commented 4 years ago

Also note the same ISOs both install over USB fine...

cxhong commented 4 years ago

can u see any message from xcat log? /var/log/xcat/cluster.log or compute.log for this node how about end of console screen? you can find it in the /var/log/consoles/<nodename> also

CJCShadowsan commented 4 years ago

No, nothing inside the cluster.log - or the compute.log.

Output from the console log:

Version 2.17.1249. Copyright (C) 2019 American Megatrends, Inc.
BIOS Date: 02/12/2019 16:14:45 Ver: S2W_3A08
Press <DEL> or <F2> to enter setup.
Press <F11> for BBS POPUP menu.
Press <F12> if you want to boot from the network.

CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Speed: 2200MHz
Total Memory: 512GB (DDR4 2133)

USB Devices total: 2 KBDs, 1 MICE, 0 MASS, 3 HUBs

[2020-08-25T12:45:12Z] >>Checking Media Presence......
[2020-08-25T12:45:12Z] >>Media Present......
[2020-08-25T12:45:12Z] >>Start PXE over IPv4.
[2020-08-25T12:45:16Z]   Station IP address is 172.17.4.5
[2020-08-25T12:45:16Z]
[2020-08-25T12:45:16Z]   Server IP address is 172.17.1.211
[2020-08-25T12:45:16Z]   NBP filename is xcat/xnba.efi
[2020-08-25T12:45:16Z]   NBP filesize is 139200 Bytes

[2020-08-25T12:45:16Z] >>Checking Media Presence......
[2020-08-25T12:45:16Z] >>Media Present......
[2020-08-25T12:45:16Z]  Downloading NBP file...
[2020-08-25T12:45:16Z]
[2020-08-25T12:45:16Z]   Succeed to download NBP file.
[2020-08-25T12:45:16Z] xNBA initialising devices...ok
[2020-08-25T12:45:16Z]
[2020-08-25T12:45:16Z]
[2020-08-25T12:45:16Z] xCAT Network Boot Agent
[2020-08-25T12:45:16Z] iPXE 1.0.3-131028 (d603e) -- Open Source Network Boot Firmware -- http://ipxe.org
[2020-08-25T12:45:16Z] Features: HTTP HTTPS iSCSI DNS TFTP EFI
[2020-08-25T12:45:16Z] net0: d8:c4:97:b8:32:cb using <NULL> on EFI SNP (open)
[2020-08-25T12:45:16Z]   [Link:up, TX:0 TXE:0 RX:0 RXE:0]
[2020-08-25T12:45:16Z] DHCP (net0 d8:c4:97:b8:32:cb)... ok
[2020-08-25T12:45:16Z] net0: 172.17.4.5/255.255.0.0 gw 172.17.1.211
[2020-08-25T12:45:16Z] Next server: 172.17.1.211
[2020-08-25T12:45:16Z] Filename: http://172.17.1.211:80/tftpboot/xcat/xnba/nodes/gpu05.uefi
[2020-08-25T12:45:16Z] http://172.17.1.211:80/tftpboot/xcat/xnba/nodes/gpu05.uefi... ok
[2020-08-25T12:45:16Z] http://172.17.1.211:80/tftpboot/xcat/elilo-x64.efi... ok
[2020-08-25T12:45:17Z] ELILO v3.14 for EFI/x86_64
[2020-08-25T12:45:17Z] Loading kernel /tftpboot/xcat/osimage/dgx-4.5-install-compute/vmlinuz...  done
[2020-08-25T12:45:23Z] Loading file /tftpboot/xcat/osimage/dgx-4.5-install-compute/initrd.img...done
[2020-08-25T12:47:28Z]
Version 2.17.1249. Copyright (C) 2019 American Megatrends, Inc.
BIOS Date: 02/12/2019 16:14:45 Ver: S2W_3A08
Press <DEL> or <F2> to enter setup.
Press <F11> for BBS POPUP menu.
Press <F12> if you want to boot from the network.

Which pretty much is what I see during boot, but is the same for a stock ubuntu 18.04.5 iso as well.

cxhong commented 4 years ago

it seems finish loading kernel and initrd, xCAT MN should receive another DHCP request after that. maybe checking if other server got request. xcatprobe detect_dhcpd -i <provision interface> -m d8:c4:97:b8:32:cb

Also, in the current xCAT development builds, xCAT skip the elilo https://github.com/xcat2/xcat-core/pull/6772, you can manually change uefi file after nodeset,

root@c910f04x35v05:/tftpboot/xcat/xnba/nodes# cat c910f04x35v07.uefi
#!gpxe
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
imgload kernel
imgargs kernel nofb utf8 auto url=http://c910f04x35v05:80/install/autoinst/c910f04x35v07 xcatd=c910f04x35v05 mirror/http/hostname=c910f04x35v05:80 netcfg/choose_interface=42:e7:0a:04:23:07 console=tty0 console=ttyS0,115200n8r locale=en_US priority=critical hostname=c910f04x35v07 live-installer/net-image=http://c910f04x35v05:80/install/ubuntu18.04.5/x86_64/install/filesystem.squashfs BOOTIF=01-${netX/mac:hexhyp} initrd=initrd
imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
imgexec kernel
root@c910f04x35v05:/tftpboot/xcat/xnba/nodes# cat c910f04x35v07
#!gpxe
#install ubuntu18.04.5-x86_64-compute
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
imgload kernel
imgargs kernel nofb utf8 auto url=http://c910f04x35v05:80/install/autoinst/c910f04x35v07 xcatd=c910f04x35v05 mirror/http/hostname=c910f04x35v05:80 netcfg/choose_interface=42:e7:0a:04:23:07 console=tty0 console=ttyS0,115200n8r locale=en_US priority=critical hostname=c910f04x35v07 live-installer/net-image=http://c910f04x35v05:80/install/ubuntu18.04.5/x86_64/install/filesystem.squashfs BOOTIF=01-${netX/machyp}
imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
imgexec kernel
CJCShadowsan commented 4 years ago

it seems finish loading kernel and initrd, xCAT MN should receive another DHCP request after that. maybe checking if other server got request. xcatprobe detect_dhcpd -i <provision interface> -m d8:c4:97:b8:32:cb

Also, in the current xCAT development builds, xCAT skip the elilo #6772, you can manually change uefi file after nodeset,

root@c910f04x35v05:/tftpboot/xcat/xnba/nodes# cat c910f04x35v07.uefi
#!gpxe
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
imgload kernel
imgargs kernel nofb utf8 auto url=http://c910f04x35v05:80/install/autoinst/c910f04x35v07 xcatd=c910f04x35v05 mirror/http/hostname=c910f04x35v05:80 netcfg/choose_interface=42:e7:0a:04:23:07 console=tty0 console=ttyS0,115200n8r locale=en_US priority=critical hostname=c910f04x35v07 live-installer/net-image=http://c910f04x35v05:80/install/ubuntu18.04.5/x86_64/install/filesystem.squashfs BOOTIF=01-${netX/mac:hexhyp} initrd=initrd
imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
imgexec kernel
root@c910f04x35v05:/tftpboot/xcat/xnba/nodes# cat c910f04x35v07
#!gpxe
#install ubuntu18.04.5-x86_64-compute
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/vmlinuz
imgload kernel
imgargs kernel nofb utf8 auto url=http://c910f04x35v05:80/install/autoinst/c910f04x35v07 xcatd=c910f04x35v05 mirror/http/hostname=c910f04x35v05:80 netcfg/choose_interface=42:e7:0a:04:23:07 console=tty0 console=ttyS0,115200n8r locale=en_US priority=critical hostname=c910f04x35v07 live-installer/net-image=http://c910f04x35v05:80/install/ubuntu18.04.5/x86_64/install/filesystem.squashfs BOOTIF=01-${netX/machyp}
imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/ubuntu18.04.5-x86_64-install-compute/initrd.img
imgexec kernel

Just so I can clarify - the only difference between these two files is the initrd=initrd bit at the end of line 4, correct?

And that's meant to be there in the .uefi file?

CJCShadowsan commented 4 years ago

Ok - so with the second item below, I get:

[   13.493678] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
[   13.507497] Please append a correct "root=" boot option; here are the available partitions:
[   13.522299] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[   13.537120] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.3.0-28-generic #30~18.04.1-Ubuntu
[   13.551959] Hardware name: NVIDIA DGX-1 with V100-32-MaxQ/DGX-1 with V100-32-MaxQ, BIOS S2W_3A08 02/14/2019
[   13.567768] u usb 1-7: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[   13.592994]  dump_stack+0x6d/0x95
[   13.607291] usb 1-7: Product: USB2.0 Hub
[   13.617782]  panic+0xfe/0x2d4
[   13.617786]  mount_block_root+0x1f4/0x2db
[   13.629234] hub 1-7:1.0: USB hub found
[   13.638585]  ? set_debug_rodata+0x17/0x17
[   13.638587]  mount_root+0x38/0x3a
[   13.638588]  prepare_namespace+0x139/0x18e
[   13.638589]  kernel_init_freeable+0x245/0x26d
[   13.649772] hub 1-7:1.0: 4 ports detected
[   13.659878]  ? rest_init+0xb0/0xb0
[   13.659879]  kernel_init+0xe/0x110
[   13.672901] usb 1-6.4: new low-speed USB device number 5 using xhci_hcd
[   13.680156]  ret_from_fork+0x35/0x40
[   13.690644] Kernel Offset: 0xc800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   13.800659] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---

So at least i'm getting some output now... But it seems to indicate i've not got a root fs to mount and then that's it.

CJCShadowsan commented 4 years ago

Confirmed - initrd=initrd needs to be initrd=initrd.img - and then things spring into life.

Is this actually going to be default behaviour for #6772 going forwards? What's the ETA?

cxhong commented 4 years ago

that's good news.
I will go over this and any changes we make that will be in the development build. our next release 2.16.1 will be at the end of October.

cxhong commented 4 years ago

@CJCShadowsan , just want to confirm again. you just need to change initrd=initrd to be initrd=initrd.img on the uefi, right?

CJCShadowsan commented 4 years ago

Yes, that's what I did and then it worked.

On Thu, 27 Aug 2020, 18:05 cxhong, notifications@github.com wrote:

@CJCShadowsan https://github.com/CJCShadowsan , just want to confirm again. you just need to change initrd=initrd to be initrd=initrd.img on the uefi, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/6811#issuecomment-682075507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD326MENGY4TXNEKNHRSVSDSC2G5FANCNFSM4QKVXLUA .

CJCShadowsan commented 3 years ago

@cxhong - did this make this into 2.16.1?

besawn commented 3 years ago

@CJCShadowsan What xCAT version (lsxcatd -v) were you using when you had this problem originally? I have been trying to reproduce on the latest xCAT development builds and I want to compare the code changes between the release you were using and what is in current master to try to determine if this is still a problem or not.

besawn commented 3 years ago

I was not able to reproduce the problem originally described in this issue on my test hardware, but two fixes were included with xCAT 2.16.2 that might be relevant. 1.) xNBA was rebuild to match upstream iPXE 1.20.1. 2.) Missing file elilo-x86.efi has been added to elilo-xcat-3.14-6_all.deb

Can you please provide feedback about whether xCAT 2.16.2 resolves this problem in your cluster?