zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.61k stars 6.5k forks source link

STM32 Ethernet stops receiving under heavy load #79066

Open biglben opened 1 week ago

biglben commented 1 week ago

Describe the bug On the STM32H7, high incoming traffic combined with a busy application can cause the Ethernet peripheral to enter a state where it fails to receive data but continues transmitting. The Ethernet receive DMA channel becomes stuck in the suspended state (visible in RPS0 field in the ETH_DMADSR register). I identified a fix in the stm32h7xx_hal_driver repository that addresses this issue by correctly setting the tail pointer (commit ceda3ce). With this fix applied, I tested various burst patterns, and the Ethernet functionality remained stable. I am raising this issue to highlight that the STM32HAL needs to be updated and to ensure that other STM32 series (likely H7 and H5) receive the same fix. Sharing this information may save others considerable debugging time (it took me about 2 days).

To Reproduce I reproduced the bug by applying the following patch to simulate the application performing other tasks or being blocked:

diff --git a/samples/net/sockets/echo_server/src/udp.c b/samples/net/sockets/echo_server/src/udp.c
index 6847ebd3eb6..222db5ea8d8 100644
--- a/samples/net/sockets/echo_server/src/udp.c
+++ b/samples/net/sockets/echo_server/src/udp.c
@@ -119,6 +119,7 @@ static int process_udp(struct data *data)
                received = recvfrom(data->udp.sock, data->udp.recv_buffer,
                                    sizeof(data->udp.recv_buffer), 0,
                                    &client_addr, &client_addr_len);
+               k_sleep(K_MSEC(110));

                if (received < 0) {
                        /* Socket error */

building using west build -p -b nucleo_h743zi/stm32h743xx zephyr/samples/net/sockets/echo_server After target was ready to receive data, I ran this script:

import socket
import time
import random
import argparse

def send_udp_flood(target_ip, target_port, packet_size, bursts, delay):
    # Create a UDP socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

    # Generate random payload
    payload = bytes(random.getrandbits(8) for _ in range(packet_size))

    try:
        while True:
            for _ in range(bursts):
                # Send packet
                sock.sendto(payload, (target_ip, target_port))

            # Delay between bursts
            time.sleep(delay)
    except KeyboardInterrupt:
        print("Flooding stopped.")
    finally:
        sock.close()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="UDP flood script.")
    parser.add_argument("ip", type=str, help="Target IP address.")
    parser.add_argument("port", type=int, help="Target port number.")
    parser.add_argument("packet_size", type=int, help="Size of each UDP packet in bytes.")
    parser.add_argument("bursts", type=int, help="Number of packets to send in each burst.")
    parser.add_argument("delay", type=float, help="Delay time between bursts in seconds.")

    args = parser.parse_args()

    send_udp_flood(args.ip, args.port, args.packet_size, args.bursts, args.delay)

using python udp_flood.py 192.0.2.1 4242 143 1000 0.1

Expected behavior The STM32 Ethernet, under heavy incoming traffic, should simply lose some packets but continue operating without interruption.

Impact None, as I have forked the STM32 HAL module and applied commit ceda3ce.

Logs and console output

[00:00:03.653,000] <inf> net_echo_server_sample: Network disconnected
[00:00:03.915,000] <inf> net_echo_server_sample: Network connected
[00:00:04.015,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:04.015,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:07.035,000] <err> eth_stm32_hal: Failed to obtain RX buffer

After this point, nothing is received anymore.

Environment:

kevinior commented 3 days ago

We're also seeing this. For example when there's incoming network traffic while mcumgr is erasing internal flash pages (causing program execution to stop).

After the "Failed to obtain RX buffer" message we can't communicate on the network any more.

Thanks @biglben , you've saved me a lot of debugging time.

kevinior commented 3 days ago

I can confirm that the fix suggested by @biglben works for us too.