Network connection unreliable on nucleo_f767zi

jkrautmacher commented 3 months ago

Describe the bug

While working on a quite minimal Zephyr firmware for the nucleo_f767zi board I noticed that the network connection is unreliable. As shown below this is reproducible with the net/telnet sample.

The bug can be noticed by the following indicators:

ICMP and any IP-based communication (IPv4 and IPv6) does not work
only the orange, not the green LED on the RJ45 connector of the board is lighting up
the stm_eth thread has significantly lower stack usage (see shell output of kernel stacks)

The bug appears after roughly 40 % of the boot processes. It was so far not observed that network communication breaks later if it was operational directly after booting the board.

To Reproduce

Steps to reproduce the behavior:

connect the nucleo_f767zi board with an Ethernet cable to a Linux PC
connect the PC also via USB to the onboard ST-Link
configure the static IPv4 address 192.0.2.2/24 on the PC
install the st-info tool (see e.g. source repo)
build and flash samples/net/telnet as described in the Getting Started Guide
execute the Python script below without any arguments (sorry, GitHub did not allow me to attach a Python file)

#!/usr/bin/python3

import argparse
import subprocess
import time

HELP = """This script can be used to debug an issue where the Zephyr-based
firmware of a target microcontroller is not able to communicate via the network
after roughly every second boot process. The Ethernet-capable microcontroller
has to be connected directly to the Linux-based host PC via an Ethernet cable
and a ST-Link. Reboots are triggered via the Open Source st-utils tool
(https://github.com/stlink-org/stlink). Connectivity is checked with ping."""

def main():
    args = parse_args()

    ok = 0
    not_ok = 0

    for i in range(args.iterations):
        print(f"Iteration #{i+1}")
        subprocess.run(["st-info", "--connect-under-reset"], check=True)
        time.sleep(args.delay)
        try:
            subprocess.run(
                ["ping", "-c", "1", args.address], check=True, stdout=subprocess.DEVNULL
            )
            print("ok")
            ok += 1
        except:
            print("not ok")
            not_ok += 1

    print(
        f"Success rate is {round(100 * ok / (ok + not_ok))} % ({ok} ok and {not_ok} failed)"
    )

def parse_args():
    parser = argparse.ArgumentParser(description=HELP)

    parser.add_argument(
        "-i",
        "--iterations",
        default=100,
        help="how often the test should be executed",
        type=int,
    )

    parser.add_argument(
        "-d",
        "--delay",
        default=10,
        help="how long to wait after reset for the ICMP request",
        type=int,
    )

    parser.add_argument(
        "-a",
        "--address",
        default="192.0.2.1",
        help="address of the microcontroller / target of ICMP request",
        type=str,
    )

    return parser.parse_args()

if __name__ == "__main__":
    main()

Expected behavior

It is expected that the used script reports 100 % success rate.

Impact

This bug is a showstopper. A firmware with such an unreliable network connection is useless.

Logs and console output

The script summarizes a test with 100 iterations on my setup with:

Success rate is 39 % (39 ok and 61 failed) for zephyr v3.6.0
Success rate is 44 % (44 ok and 56 failed) for zephyr v3.7.0
Success rate is 49 % (49 ok and 51 failed) for zephyr v3.7.0 with external power supply (12 V via VIN)

Environment (please complete the following information):

OS: Arch Linux
Toolchain: Zephyr SDK 0.16.5
Zephyr version: v3.7.0 and v3.6.0

jkrautmacher commented 3 months ago

Updated bug report after it turned out that Zephyr v3.6.0 is affected too.

marwaiehm-st commented 2 months ago

The bug is reproduced on nucleo_f767zi but not on stm32f769i_disco, so i tried to compare the two :

[ ] I Verified Device Tree Configuration of the Ethernet MAC, its similar for the both nucleo_f767zi and stm32f769i_disco:

        mac: ethernet@40028000 {
            compatible = "st,stm32-ethernet";
            reg = <0x40028000 0x8000>;
            interrupts = <61 0>;
            clock-names = "stmmaceth", "mac-clk-tx",
                      "mac-clk-rx", "mac-clk-ptp";
            clocks = <&rcc STM32_CLOCK_BUS_AHB1 0x02000000>,
                 <&rcc STM32_CLOCK_BUS_AHB1 0x04000000>,
                 <&rcc STM32_CLOCK_BUS_AHB1 0x08000000>,
                 <&rcc STM32_CLOCK_BUS_AHB1 0x10000000>;
            status = "disabled";
        };

[ ] I Verified the clock settings for the Ethernet MAC and they are correctly configured.

The difference :

STM32F769I-DISCO board includes additional components for PoE, such as the PM8800A PoE controller, transformers, and various passive components.
NUCLEO-F767ZI board lacks these components, which might affect the stability and performance of the Ethernet connection.

Impact on Ethernet:

The PoE circuit on the STM32F769I-DISCO board provides a stable power supply to the Ethernet PHY, which can improve the reliability of the Ethernet connection.
The NUCLEO-F767ZI board relies on a different power supply configuration, which might be less stable, leading to the observed 40% reliability.

FRASTM commented 2 months ago

@jkrautmacher can you please confirm that ?

jkrautmacher commented 2 months ago

I would be glad to help but unsure what to confirm. As far as I understood the current theory is that the power supply of the PHY on nucleo_f767zi might not be stable enough during init so that initialization fails.

If this is correct my next debugging step would be to connect an oscilloscope to the supply voltage of the PHY and observe it during init. That together with a debug GPIO from the kernel code which toggles right before and after Ethernet init should validate this theory or not. The next step would be to either fix the hardware design or to add a workaround to the Zephyr kernel (longer delays or similar).

I am lacking two things to verify the theory:

an oscilloscope at home
detailed schematics of nucleo_f767zi

Is there maybe more public information about the board than in UM1974 Rev 10 I overlooked? The oscilloscope situation I could maybe improve but it would likely be way faster if you could do that at ST if this is an option.

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

jkrautmacher commented 5 days ago

Since it was easy to test for me I checked how the board / firmware behaves with an external power supply instead of powering it via USB. I moved the jumper on JP3 from U5V to VIN-5V and provided 12 VDC to VIN and GND on CN8.

Result is: Success rate is 39 % (39 ok and 61 failed)

So this did not fix it. Updated the initial bug report accordingly.

zephyrproject-rtos / zephyr

Network connection unreliable on nucleo_f767zi #77794