zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.99k stars 6.69k forks source link

Network connection unreliable on nucleo_f767zi #77794

Open jkrautmacher opened 3 months ago

jkrautmacher commented 3 months ago

Describe the bug

While working on a quite minimal Zephyr firmware for the nucleo_f767zi board I noticed that the network connection is unreliable. As shown below this is reproducible with the net/telnet sample.

The bug can be noticed by the following indicators:

The bug appears after roughly 40 % of the boot processes. It was so far not observed that network communication breaks later if it was operational directly after booting the board.

To Reproduce

Steps to reproduce the behavior:

  1. connect the nucleo_f767zi board with an Ethernet cable to a Linux PC
  2. connect the PC also via USB to the onboard ST-Link
  3. configure the static IPv4 address 192.0.2.2/24 on the PC
  4. install the st-info tool (see e.g. source repo)
  5. build and flash samples/net/telnet as described in the Getting Started Guide
  6. execute the Python script below without any arguments (sorry, GitHub did not allow me to attach a Python file)
#!/usr/bin/python3

import argparse
import subprocess
import time

HELP = """This script can be used to debug an issue where the Zephyr-based
firmware of a target microcontroller is not able to communicate via the network
after roughly every second boot process. The Ethernet-capable microcontroller
has to be connected directly to the Linux-based host PC via an Ethernet cable
and a ST-Link. Reboots are triggered via the Open Source st-utils tool
(https://github.com/stlink-org/stlink). Connectivity is checked with ping."""

def main():
    args = parse_args()

    ok = 0
    not_ok = 0

    for i in range(args.iterations):
        print(f"Iteration #{i+1}")
        subprocess.run(["st-info", "--connect-under-reset"], check=True)
        time.sleep(args.delay)
        try:
            subprocess.run(
                ["ping", "-c", "1", args.address], check=True, stdout=subprocess.DEVNULL
            )
            print("ok")
            ok += 1
        except:
            print("not ok")
            not_ok += 1

    print(
        f"Success rate is {round(100 * ok / (ok + not_ok))} % ({ok} ok and {not_ok} failed)"
    )

def parse_args():
    parser = argparse.ArgumentParser(description=HELP)

    parser.add_argument(
        "-i",
        "--iterations",
        default=100,
        help="how often the test should be executed",
        type=int,
    )

    parser.add_argument(
        "-d",
        "--delay",
        default=10,
        help="how long to wait after reset for the ICMP request",
        type=int,
    )

    parser.add_argument(
        "-a",
        "--address",
        default="192.0.2.1",
        help="address of the microcontroller / target of ICMP request",
        type=str,
    )

    return parser.parse_args()

if __name__ == "__main__":
    main()

Expected behavior

It is expected that the used script reports 100 % success rate.

Impact

This bug is a showstopper. A firmware with such an unreliable network connection is useless.

Logs and console output

The script summarizes a test with 100 iterations on my setup with:

Environment (please complete the following information):

jkrautmacher commented 3 months ago

Updated bug report after it turned out that Zephyr v3.6.0 is affected too.

marwaiehm-st commented 2 months ago

The bug is reproduced on nucleo_f767zi but not on stm32f769i_disco, so i tried to compare the two :

        mac: ethernet@40028000 {
            compatible = "st,stm32-ethernet";
            reg = <0x40028000 0x8000>;
            interrupts = <61 0>;
            clock-names = "stmmaceth", "mac-clk-tx",
                      "mac-clk-rx", "mac-clk-ptp";
            clocks = <&rcc STM32_CLOCK_BUS_AHB1 0x02000000>,
                 <&rcc STM32_CLOCK_BUS_AHB1 0x04000000>,
                 <&rcc STM32_CLOCK_BUS_AHB1 0x08000000>,
                 <&rcc STM32_CLOCK_BUS_AHB1 0x10000000>;
            status = "disabled";
        };

The difference :

Impact on Ethernet:

FRASTM commented 2 months ago

@jkrautmacher can you please confirm that ?

jkrautmacher commented 2 months ago

I would be glad to help but unsure what to confirm. As far as I understood the current theory is that the power supply of the PHY on nucleo_f767zi might not be stable enough during init so that initialization fails.

If this is correct my next debugging step would be to connect an oscilloscope to the supply voltage of the PHY and observe it during init. That together with a debug GPIO from the kernel code which toggles right before and after Ethernet init should validate this theory or not. The next step would be to either fix the hardware design or to add a workaround to the Zephyr kernel (longer delays or similar).

I am lacking two things to verify the theory:

Is there maybe more public information about the board than in UM1974 Rev 10 I overlooked? The oscilloscope situation I could maybe improve but it would likely be way faster if you could do that at ST if this is an option.

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

jkrautmacher commented 5 days ago

Since it was easy to test for me I checked how the board / firmware behaves with an external power supply instead of powering it via USB. I moved the jumper on JP3 from U5V to VIN-5V and provided 12 VDC to VIN and GND on CN8.

Result is: Success rate is 39 % (39 ok and 61 failed)

So this did not fix it. Updated the initial bug report accordingly.