Closed kevinior closed 3 weeks ago
@marwaiehm-st Would it help if I created a pull request with my workaround? It might make it clearer what the problem we're seeing is.
@marwaiehm-st Would it help if I created a pull request with my workaround? It might make it clearer what the problem we're seeing is.
Yes, absolutely! Creating a pull request with your workaround would be good. It will definitely help clarify the problem. Thank you!
@marwaiehm-st Here's the pull request: https://github.com/zephyrproject-rtos/zephyr/pull/79693
This is what we're running successfully in our version of eth_stm32_hal.c (which had to be patched to work with a different PHY).
This only happens with
CONFIG_ETH_STM32_HAL_API_V2
. I'll describe what I found in my investigation.If the network cable is unplugged we usually get the
HAL_ETH_TransmitIT tx_int_sem take timeout
error message. If we got that error and the network is reconnected we get a bus error.After a lot of debugging I worked out that when we see
tx_int_sem take timeout
the error-handling code's called toHAL_ETH_TxFreeCallback
releases the allocated buffer(s) but the tx descriptor still has a pointer to thestruct eth_stm32_tx_context
object that was allocated oneth_tx
's stack.When the network is reconnected there will eventually be a call to
HAL_ETH_ReleaseTxPacket
, which callsHAL_ETH_TxFreeCallback
with the (now invalid) pointer to the tx context. WhenHAL_ETH_TxFreeCallback
tries to dereference that pointer we get a bus error.I tried calling
HAL_ETH_ReleaseTxPacket
instead ofHAL_ETH_TxFreeCallback
in the error-handling code but that doesn't work because the tx packet is still owned by the hardware and can't be released.This problem reproduces reliably on our hardware (STM32H743), where we have eth_stm32_hal patched to work with a 10base-T1S PHY. That PHY is troublesome in a variety of ways, in particular it doesn't seem to communicate link errors. I can't reproduce it on a Nucleo-H753ZI board.
As a workaround I patched eth_stm32_hal to allocate tx contexts from a static array (similar to how the buffers are allocated), so that they remain valid until the HAL is finished with them.
Originally posted by @kevinior in https://github.com/zephyrproject-rtos/zephyr/issues/75670#issuecomment-2376225970