zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.92k stars 6.65k forks source link

STM32 Ethernet stops receiving under heavy load #79066

Open biglben opened 1 month ago

biglben commented 1 month ago

Describe the bug On the STM32H7, high incoming traffic combined with a busy application can cause the Ethernet peripheral to enter a state where it fails to receive data but continues transmitting. The Ethernet receive DMA channel becomes stuck in the suspended state (visible in RPS0 field in the ETH_DMADSR register). I identified a fix in the stm32h7xx_hal_driver repository that addresses this issue by correctly setting the tail pointer (commit ceda3ce). With this fix applied, I tested various burst patterns, and the Ethernet functionality remained stable. I am raising this issue to highlight that the STM32HAL needs to be updated and to ensure that other STM32 series (likely H7 and H5) receive the same fix. Sharing this information may save others considerable debugging time (it took me about 2 days).

To Reproduce I reproduced the bug by applying the following patch to simulate the application performing other tasks or being blocked:

diff --git a/samples/net/sockets/echo_server/src/udp.c b/samples/net/sockets/echo_server/src/udp.c
index 6847ebd3eb6..222db5ea8d8 100644
--- a/samples/net/sockets/echo_server/src/udp.c
+++ b/samples/net/sockets/echo_server/src/udp.c
@@ -119,6 +119,7 @@ static int process_udp(struct data *data)
                received = recvfrom(data->udp.sock, data->udp.recv_buffer,
                                    sizeof(data->udp.recv_buffer), 0,
                                    &client_addr, &client_addr_len);
+               k_sleep(K_MSEC(110));

                if (received < 0) {
                        /* Socket error */

building using west build -p -b nucleo_h743zi/stm32h743xx zephyr/samples/net/sockets/echo_server After target was ready to receive data, I ran this script:

import socket
import time
import random
import argparse

def send_udp_flood(target_ip, target_port, packet_size, bursts, delay):
    # Create a UDP socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

    # Generate random payload
    payload = bytes(random.getrandbits(8) for _ in range(packet_size))

    try:
        while True:
            for _ in range(bursts):
                # Send packet
                sock.sendto(payload, (target_ip, target_port))

            # Delay between bursts
            time.sleep(delay)
    except KeyboardInterrupt:
        print("Flooding stopped.")
    finally:
        sock.close()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="UDP flood script.")
    parser.add_argument("ip", type=str, help="Target IP address.")
    parser.add_argument("port", type=int, help="Target port number.")
    parser.add_argument("packet_size", type=int, help="Size of each UDP packet in bytes.")
    parser.add_argument("bursts", type=int, help="Number of packets to send in each burst.")
    parser.add_argument("delay", type=float, help="Delay time between bursts in seconds.")

    args = parser.parse_args()

    send_udp_flood(args.ip, args.port, args.packet_size, args.bursts, args.delay)

using python udp_flood.py 192.0.2.1 4242 143 1000 0.1

Expected behavior The STM32 Ethernet, under heavy incoming traffic, should simply lose some packets but continue operating without interruption.

Impact None, as I have forked the STM32 HAL module and applied commit ceda3ce.

Logs and console output

[00:00:03.653,000] <inf> net_echo_server_sample: Network disconnected
[00:00:03.915,000] <inf> net_echo_server_sample: Network connected
[00:00:04.015,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:04.015,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:07.035,000] <err> eth_stm32_hal: Failed to obtain RX buffer

After this point, nothing is received anymore.

Environment:

kevinior commented 1 month ago

We're also seeing this. For example when there's incoming network traffic while mcumgr is erasing internal flash pages (causing program execution to stop).

After the "Failed to obtain RX buffer" message we can't communicate on the network any more.

Thanks @biglben , you've saved me a lot of debugging time.

kevinior commented 1 month ago

I can confirm that the fix suggested by @biglben works for us too.

marwaiehm-st commented 1 month ago

I have tested the modification described in the commit and can confirm that it works as expected. Thank you @biglben

marwaiehm-st commented 3 weeks ago

Hi @biglben Don't hesitate to open a PR containing the fix.

biglben commented 3 weeks ago

Hi @marwaiehm-st I can open a PR with the fix for STM32H7, but i am not sure which other series have the same issue (i assume h5 too, but can not test). I am not sure if this fix should be included in a HAL Update. There are more fixes in the stm32h7xx-hal-driver repo which are not included in the zephyr fork

erwango commented 2 weeks ago

@biglben Sure, you can open a PR which with a commit cherrypicked from STM32H7 HAL. See https://github.com/zephyrproject-rtos/hal_stm32/pull/226 as example