zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.98k stars 6.69k forks source link

eth_stm32_hal: bus error after network disconnect and connect #79037

Closed kevinior closed 3 weeks ago

kevinior commented 2 months ago

This only happens with CONFIG_ETH_STM32_HAL_API_V2. I'll describe what I found in my investigation.

If the network cable is unplugged we usually get the HAL_ETH_TransmitIT tx_int_sem take timeout error message. If we got that error and the network is reconnected we get a bus error.

After a lot of debugging I worked out that when we see tx_int_sem take timeout the error-handling code's called to HAL_ETH_TxFreeCallback releases the allocated buffer(s) but the tx descriptor still has a pointer to the struct eth_stm32_tx_context object that was allocated on eth_tx's stack.

When the network is reconnected there will eventually be a call to HAL_ETH_ReleaseTxPacket, which calls HAL_ETH_TxFreeCallback with the (now invalid) pointer to the tx context. When HAL_ETH_TxFreeCallback tries to dereference that pointer we get a bus error.

I tried calling HAL_ETH_ReleaseTxPacket instead of HAL_ETH_TxFreeCallback in the error-handling code but that doesn't work because the tx packet is still owned by the hardware and can't be released.

This problem reproduces reliably on our hardware (STM32H743), where we have eth_stm32_hal patched to work with a 10base-T1S PHY. That PHY is troublesome in a variety of ways, in particular it doesn't seem to communicate link errors. I can't reproduce it on a Nucleo-H753ZI board.

As a workaround I patched eth_stm32_hal to allocate tx contexts from a static array (similar to how the buffers are allocated), so that they remain valid until the HAL is finished with them.

Originally posted by @kevinior in https://github.com/zephyrproject-rtos/zephyr/issues/75670#issuecomment-2376225970

kevinior commented 1 month ago

@marwaiehm-st Would it help if I created a pull request with my workaround? It might make it clearer what the problem we're seeing is.

marwaiehm-st commented 1 month ago

@marwaiehm-st Would it help if I created a pull request with my workaround? It might make it clearer what the problem we're seeing is.

Yes, absolutely! Creating a pull request with your workaround would be good. It will definitely help clarify the problem. Thank you!

kevinior commented 1 month ago

@marwaiehm-st Here's the pull request: https://github.com/zephyrproject-rtos/zephyr/pull/79693

This is what we're running successfully in our version of eth_stm32_hal.c (which had to be patched to work with a different PHY).