wenyi0421 / turing-pi

Turing-pi BMC firmware
https://turingpi.com/
106 stars 28 forks source link

Node (slot) 3 won't boot Talos OS on cm4 #13

Open bhuism opened 1 year ago

bhuism commented 1 year ago

serial log:

RPI Compute Module 4 (0xd03141)
Core:  209 devices, 16 uclasses, devicetree: board
MMC:   mmcnr@7e300000: 1, mmc@7e340000: 0
Loading Environment from FAT... Unable to read "uboot.env" from mmc0:1... 
In:    serial
Out:   vidconsole
Err:   vidconsole
Net:   eth0: ethernet@7d580000
PCIe BRCM: link up, 5.0 Gbps x1 (SSC)
"Error" handler, esr 0xbf000002
elr: 00000000000af544 lr : 00000000000af500 (reloc)
elr: 000000003df81544 lr : 000000003df81500
x0 : 000000000000dead x1 : 0000000000100000
x2 : 0000000000008000 x3 : 00000000fd508000
x4 : 0000000000000000 x5 : 0000000000000001
x6 : 000000003df82aac x7 : 000000003db40890
x8 : 0000000000008a6c x9 : 0000000000000008
x10: 000000003db4023c x11: 0000000000000002
x12: 0000000000000140 x13: 000000003db40228
x14: 000000003db40890 x15: 0000000000000000
x16: 000000003df82b84 x17: d4244e8100000000
x18: 000000003db4dd70 x19: 0000000000000001
x20: 000000003db40300 x21: 000000003db5b480
x22: 0000000000000000 x23: 0000000000010000
x24: 000000003dfc60a1 x25: 000000003db5b0b0
x26: 000000000000ffff x27: 0000000000000000
x28: 0000000000000000 x29: 000000003db40260

Code: 350001f3 f94017e0 39400000 92401c00 (d5033fbf) 
Resetting CPU ...
wenyi0421 commented 1 year ago

Does the problematic pi work on other slots 1 2 4 or on the official carrier board?

bhuism commented 1 year ago

@wenyi0421 yes, perfectly and the other 2 daughterboards with cm4 show the same symptoms, all work in all slots except slot 3. And none of them work in slot 3. But since this looks like a u-boot issue, maybe I should post in issue with them. I've wrote a howto here: https://github.com/bhuism/talos-tpi2 if you want to reproduce.

wenyi0421 commented 1 year ago

Please use the official image of the Raspberry Pi to test it, and consider replacing the CM4 Adapter V1.0 to try, this may be a hardware problem

bhuism commented 1 year ago

I've replaced the adapter+cm4 already, I have 4 adapters and 3 cm4 modules, none of them work (with Talos using u-boot) in slot 3, all of them work in all other slots.

I'll try raspberry pi os

krarey commented 1 year ago

I can confirm this same issue, with the same error state (syndrome register value 0xbf000002) on any of my CM4 modules (8GB, Wifi, eMMC) connected to the node3 slot that boots using U-Boot. In my case the version packaged in the Fedora IoT 37 image.

bhuism commented 1 year ago

Raspberry Pi OS lite boots just fine in node3 slot, I swapped cd4 and daughterboards now to get u-boot to work.

wenyi0421 commented 1 year ago

Maybe because node3 is connected to sata, uboot wants to start from sata, but it fails.This may require modifying the uboot startup items in the flashed OS

bhuism commented 1 year ago

I'm afraid so

Daedaluz commented 1 year ago

I gave this an attempt by building a custom u-boot image with modified boot-order and patched the pre-built talos image with it. While it managed to get into grub, it failed silently and reset anyway. I didn't find any obvious issues with the grub configuration, but i might look closer into it another day.

Maybe grub also probes SATA and fails?

bhuism commented 1 year ago

At least interesting @Daedaluz , thanks for looking into it

jlec commented 1 year ago

Let me know if there is something to test. Happy to tinker around as well

Daedaluz commented 1 year ago

I poked around a little bit again, and it looks like grub actually loads the kernel / initramfs properly, but the kernel itself makes it reset. I can manually load the kernel and the initramfs without issues, the reset happens after the boot command.

I kind of expected something from the kernel, like an oops or panic whatnot, but the only thing i get after boot command is:

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

Ideas?

bhuism commented 1 year ago

Thanks @Daedaluz, this is too hard core kernel for me, btw did you know that slot/node 2 boots fine with a sata mini pcie card inserted, with the same chipset as on the tpiv2 board? namely asmedia asm 1061 chipset. Maybe this helps?

jlec commented 1 year ago

Thanks @Daedaluz, this is too hard core kernel for me, btw did you know that slot/node 2 boots fine with a sata mini pcie card inserted, with the same chipset as on the tpiv2 board? namely asmedia asm 1061 chipset. Maybe this helps?

Different for me, doesn't boot here.

jlec commented 1 year ago

I poked around a little bit again, and it looks like grub actually loads the kernel / initramfs properly, but the kernel itself makes it reset. I can manually load the kernel and the initramfs without issues, the reset happens after the boot command.

I kind of expected something from the kernel, like an oops or panic whatnot, but the only thing i get after boot command is:

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

Ideas?

Not much. is there something similar in a non uboot system we can compare to? Maybe report upstream at talos or uboot or both as well.

bhuism commented 1 year ago

obviously, when rk1 comes out, we definitely needs talos to boot on it

Daedaluz commented 1 year ago

obviously!

bhuism commented 1 year ago

Working on this a little bit, conform: https://github.com/u-boot/u-boot/blob/master/doc/develop/crash_dumps.rst

I get:

$ echo 'Code: 350001f3 f94017e0 39400000 92401c00 (d5033fbf)' |   CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64 scripts/decodecode
Code: 350001f3 f94017e0 39400000 92401c00 (d5033fbf)
All code
========
   0:   350001f3    cbnz    w19, 0x3c
   4:   f94017e0    ldr x0, [sp, #40]
   8:   39400000    ldrb    w0, [x0]
   c:   92401c00    and x0, x0, #0xff
  10:*  d5033fbf    dmb sy      <-- trapping instruction

Code starting with the faulting instruction
===========================================
   0:   d5033fbf    dmb sy
SheGe commented 1 year ago

I hit the same issue by working on alpine based image powered by uboot. In my opinion the problem is related to uboot and sata controller. Node3 is not working because has native SATA controller connected. Both, Node1 and Node2 are working. When mpcie SATA controller is connected those nodes behave the same as Node3.

bhuism commented 1 year ago

@SheGe My experience was not the same with talos, I booted talos fine with a satacontroller in the mpcie slot in node2 (and same chip a on the tpiv2 board on node 3) go figure

bhuism commented 1 year ago

this issues is also reported a sidero here: https://github.com/siderolabs/talos/issues/7358

bhuism commented 1 year ago

I make a custom rpi image with a logging/trace enabled u-boot, see log attached

ubootlog.txt

bhuism commented 1 year ago

new log and map:

u-boot.txt ubootlogv2.log

bhuism commented 1 year ago

I've got talos booted on node3 with a workaround, a custom u-boot.bin (and thus talos image) was needed, it's a hack though.

maxromanovsky commented 1 year ago

@bhuism I'm facing this same issue with eMMC CM4s, but also in slots 1 & 2 (as I have mini PCIe SATA cards installed there).

How did you get UART working there? Did it work out of the box? Or did you tinker with BOOT_UART in EEPROM, config.txt or something similar in the image flashed to eMMC?

In my case (RPI debug probe @ 115200) output is empty.

bhuism commented 1 year ago

@maxromanovsky uarts work out of the box, come to the discord chat, and search for serial debug, you can easily get serial to any node from the bmc command line

maxromanovsky commented 1 year ago

@bhuism thanks! What command do you use? I tried the following one on BMC, and the output is always empty:

# tpi --uart=get -n 1
{
    "response": [{
            "uart": ""
        }]
}#
bhuism commented 1 year ago

@maxromanovsky the serial ports of the cm4's are all connected to serial devices on the bmc, all 4. I've wrote something up here: https://github.com/bhuism/talos-tpi2#hardwired-bmc-serial-port-connections-to-nodes, you can use microcom or picocom of the bmc.

CFSworks commented 1 year ago

I've got talos booted on node3 with a workaround, a custom u-boot.bin (and thus talos image) was needed, it's a hack though.

Since you're already set up to build custom u-boot.bins, could you revert your workaround, confirm the problem still occurs, and then try this patch? 0001-pci-pcie-brcmstb-do-not-rely-on-CLKREQ-signal.patch

If it works for you, you may provide (at your option) an Acked-by:/Reported-by:/Tested-by: that I will use to credit you on the patch when I submit it upstream, if you'd like.

bhuism commented 1 year ago

@CFSworks will do asap

bhuism commented 1 year ago

@CFSworks it does get past u-boot now and into grub, but gets stuck in booting the kernel:

Booting `A - Talos v1.4.7-dirty'

EFI stub: Booting Linux Kernel...
RPI Compute Module 4 (0xc03141)
PCIe BRCM: link up, 5.0 Gbps x1 (SSC)
PCI: Failed autoconfig bar 10
PCI: Failed autoconfig bar 14
PCI: Failed autoconfig bar 18
PCI: Failed autoconfig bar 1c
PCI: Failed autoconfig bar 20

after this boot loops

this log is not clean btw, I use picotom from the bmc (I ssh into bmc) and the lines that come back often look garbled

I tried u-boot development branche and the exact u-boot version (2023.1) talos 1.4.7 is using, incl their patches, both same result, I was using development in my patch.

(this talos image with ur patch boots fine on a normal rpi4b btw)

CFSworks commented 1 year ago

The pasted output is all (presumably) normal output from U-Boot.

Could you log into the BMC and have this running: microcom -s 115200 /dev/ttyS4 | tee node3.log ...and try a boot? This should capture all of the characters into node3.log, so that later terminal shenanigans don't overwrite earlier output.

bhuism commented 1 year ago

@CFSworks here u go

node3.log

CFSworks commented 1 year ago

I just reproduced this boot loop on my own hardware. I'll spend some time today seeing if this new problem is a shortcoming in my U-Boot patch or a problem in a different component of Talos.

CFSworks commented 1 year ago

Editing the GRUB boot entry with e and adding the following to the kernel cmdline: earlycon=pl011,0xfe201000,115200 ...allows kernel boot log output. The kernel is failing to boot, with:

[    2.226887] pci_bus 0000:00: root bus resource [bus 00-ff]
[    2.232470] pci_bus 0000:00: root bus resource [mem 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
[    2.243009] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    2.249199] pci 0000:00:00.0: PME# supported from D0 D3hot
[    2.258616] pci_bus 0000:01: supply vpcie3v3 not found, using dummy regulator
[    2.266062] pci_bus 0000:01: supply vpcie3v3aux not found, using dummy regulator
[    2.273635] pci_bus 0000:01: supply vpcie12v not found, using dummy regulator
[    2.334384] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
[    2.389016] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
[    2.389033] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.1.44-talos #1
[    2.389046] Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2023.07-00970-gd74fa80c0a 07/01/2023
[    2.389053] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    2.389064] pc : pci_generic_config_read+0x64/0xf0
[    2.389093] lr : pci_generic_config_read+0x4c/0xf0
[    2.389107] sp : ffff80000802b7d0
[    2.389111] x29: ffff80000802b7d0 x28: ffff3828409cf800 x27: 0000000000000001
[    2.389128] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[    2.389141] x23: 0000000000000000 x22: 0000000000000000 x21: ffff80000802b854
[    2.389153] x20: 0000000000000004 x19: ffff3828409cf800 x18: 0000000000000000
[    2.389165] x17: 6f74616c75676572 x16: 0000000000000107 x15: 072f075407470720
[    2.389178] x14: 0730072e07350720 x13: 072f075407470720 x12: 0730072e07350720
[    2.389191] x11: 0720072007200720 x10: 0720072007200720 x9 : ffffc99190cc7ebc
[    2.389203] x8 : ffff80000802b578 x7 : 000000000002ffe8 x6 : 00000000000affa8
[    2.389215] x5 : ffffc99190cc7e70 x4 : ffff80000802b854 x3 : 000000000000000b
[    2.389227] x2 : ffff800008bc9000 x1 : 00000000deaddead x0 : ffff800008bc8000
[    2.389242] Kernel panic - not syncing: Asynchronous SError Interrupt
[    2.389247] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.1.44-talos #1
[    2.389256] Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2023.07-00970-gd74fa80c0a 07/01/2023
[    2.389262] Call trace:
[    2.389267]  dump_backtrace.part.0+0xec/0x100
[    2.389282]  show_stack+0x30/0x40
[    2.389291]  dump_stack_lvl+0x64/0x80
[    2.389310]  dump_stack+0x18/0x34
[    2.389321]  panic+0x180/0x35c
[    2.389334]  nmi_panic+0xbc/0xc0
[    2.389345]  arm64_serror_panic+0x78/0x84
[    2.389355]  do_serror+0x30/0x7c
[    2.389365]  el1h_64_error_handler+0x3c/0x70
[    2.389379]  el1h_64_error+0x78/0x7c
[    2.389387]  pci_generic_config_read+0x64/0xf0
[    2.389400]  pci_bus_read_config_dword+0xa0/0x160
[    2.389414]  pci_bus_generic_read_dev_vendor_id+0x40/0x180
[    2.389431]  pci_scan_single_device+0xb4/0x120
[    2.389447]  pci_scan_slot+0x6c/0x200
[    2.389461]  pci_scan_child_bus_extend+0x48/0x240
[    2.389478]  pci_scan_bridge_extend+0x158/0x580
[    2.389494]  pci_scan_child_bus_extend+0xd0/0x240
[    2.389509]  pci_scan_root_bus_bridge+0x6c/0xe0
[    2.389525]  pci_host_probe+0x24/0xd0
[    2.389533]  brcm_pcie_probe+0x258/0x630
[    2.389545]  platform_probe+0x70/0xcc
[    2.389563]  really_probe+0xc8/0x2e4
[    2.389576]  __driver_probe_device+0x80/0x11c
[    2.389589]  driver_probe_device+0x4c/0x120
[    2.389601]  __driver_attach+0xa4/0x170
[    2.389614]  bus_for_each_dev+0x84/0xdc
[    2.389624]  driver_attach+0x34/0x44
[    2.389636]  bus_add_driver+0x15c/0x210
[    2.389648]  driver_register+0x7c/0x13c
[    2.389661]  __platform_driver_register+0x38/0x4c
[    2.389677]  brcm_pcie_driver_init+0x30/0x64
[    2.389689]  do_one_initcall+0x60/0x270
[    2.389700]  kernel_init_freeable+0x478/0x554
[    2.389709]  kernel_init+0x30/0x140
[    2.389723]  ret_from_fork+0x10/0x20
[    2.389738] SMP: stopping secondary CPUs
[    2.389748] Kernel Offset: 0x499188380000 from 0xffff800008000000
[    2.389754] PHYS_OFFSET: 0xffffc7d8c0000000
[    2.389758] CPU features: 0x40000,2013c080,0000421b
[    2.389765] Memory Limit: none

...which is this panic tracked upstream in the Linux kernel bug database.

This might be exacerbated by the timing of U-Boot using the PCIe RC for a while and then shutting it down later in the boot when EFI boot services are exited, but is not itself the fault of U-Boot, so I'm going to send that patch upstream now.