zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10k stars 6.16k forks source link

drivers: uart_sam0: uart_sam0_irq_update does not clear IRQs correctly for most sam0 SOCs #59903

Closed atnon closed 2 months ago

atnon commented 1 year ago

Description Driver implementation assumes that only sam0 SOCs with REV_SERCOM == 0x500 has error interrupt flags, whereas all supported sam0 SOCs except the samd21 actually supports them. This also results in interrupt flags not being cleared correctly in uart_sam0_irq_update.

To Reproduce Should be enough running a MODBUS RTU Server on a physical SERCOM instance, on e.g. a samd21_xpro board or similar non-REV_SERCOM == 0x500 towards an actual RS485 bus via transceiver. I might need to make some minimal example, as the modbus rtu server sample does not support the atsamd21_xpro board. Just a regular UART should work as well given what the failure mechanism seems to be.

Expected behavior During normal operation, I should be able to issue an arbitrary amount of read writes to registers and coils defined on the MODBUS Server running in Zephyr.

Impact I found this while trying to get MODBUS RTU up and running, I saw that I would get a response from the MCU once, after which it would exhibit the following symptoms:

  1. Never deassert the drive enable pin to the RS485 transceiver, effectively locking up the bus.
  2. Stop responding on the shell running over an USB CDC-ACM UART. Stop responding to everything really.
  3. Modbus Master software can read once correctly, subsequent register reads result in timeouts since the MCU is not able to respond and has locked up the RS485 bus electrically.
  4. Resetting the MCU allows for repetition of the issue when issuing a new read.

The firmware simply grinds to a complete halt.

I think the irq-callback loop between the modbus subsystem and the UART driver would just go in circles with the IRQ flags never being cleared, starving the other running threads. I did not see the INTFLAG register deassert any set flags after it had been set while debugging at least.

Doing a hackish edit to the driver uart_sam0.c like below solves the issue for now, but it's not exacly ready for a PR. Simply adding the same code as present witin the SERCOM_REV500 ifdef, but in the else clause:

static int uart_sam0_irq_update(const struct device *dev)
{
    /* Clear sticky interrupts */
    const struct uart_sam0_dev_cfg *config = dev->config;
    SercomUsart * const regs = config->regs;

#if defined(SERCOM_REV500)
    /*
     * Cache the TXC flag, and use this cached value to clear the interrupt
     * if we do not used the cached value, there is a chance TXC will set
     * after caching...this will cause TXC to never cached.
     */
    struct uart_sam0_dev_data *const dev_data = dev->data;

    dev_data->txc_cache = regs->INTFLAG.bit.TXC;
    regs->INTFLAG.reg = SERCOM_USART_INTENCLR_ERROR
              | SERCOM_USART_INTENCLR_RXBRK
              | SERCOM_USART_INTENCLR_CTSIC
              | SERCOM_USART_INTENCLR_RXS
              | (dev_data->txc_cache << SERCOM_USART_INTENCLR_TXC_Pos);
#else
    //regs->INTFLAG.reg = SERCOM_USART_INTENCLR_RXS;
    struct uart_sam0_dev_data *const dev_data = dev->data;
    dev_data->txc_cache = regs->INTFLAG.bit.TXC;
    regs->INTFLAG.reg = SERCOM_USART_INTFLAG_ERROR 
                        | SERCOM_USART_INTFLAG_RXBRK
                        | SERCOM_USART_INTENCLR_CTSIC
                        | SERCOM_USART_INTFLAG_RXS 
                        | (dev_data->txc_cache << SERCOM_USART_INTFLAG_TXC_Pos);
#endif
    return 1;
}

Analysis The uart_sam0 driver has a version check for the SERCOM revision, but only for REV_SERCOM == 0x500: https://github.com/zephyrproject-rtos/zephyr/blob/f6eeb9dc84bec63dea231a49e97992875b659614/drivers/serial/uart_sam0.c#L24-L30

The comment states that the error flag is only present where REV_SERCOM == 0x500, this is unfortunately not true and leads to a majority of sam0 SOCs to not handle clearing of interrupt flags correctly. Actually, the part I'm having issues with does noto have to do much with errors but rather clearing of sticky interrupts. Today, as of 2023-07-01, samd20 is the only sam0 SOC in the zephyrproject-rtos/hal_atmel repo not supporting the error flag. See table below, produced by yours truly looking in ASF headers and family datasheets: Part Has error support ASF define REV_SERCOM Sticky flags
samc20 yes SERCOM_U2201 0x311 ERROR,RXBRK,CTSIC,RXS,TXC
samc20n yes SERCOM_U2201 0x320 ERROR,RXBRK,CTSIC,RXS,TXC
samc21 yes SERCOM_U2201 0x31 ERROR,RXBRK,CTSIC,RXS,TXC
samc21n yes SERCOM_U2201 0x320 ERROR,RXBRK,CTSIC,RXS,TXC
samd20 no SERCOM_U2201 0x102 RXS,TXC
samd21 yes SERCOM_U2201 0x200 ERROR,RXBRK,CTSIC,RXS,TXC
samd51 yes SERCOM_U2201 0x500 ERROR,RXBRK,CTSIC,RXS,TXC
same51 yes SERCOM_U2201 0x500 ERROR,RXBRK,CTSIC,RXS,TXC
same53 yes SERCOM_U2201 0x500 ERROR,RXBRK,CTSIC,RXS,TXC
same54 yes SERCOM_U2201 0x500 ERROR,RXBRK,CTSIC,RXS,TXC
saml21 yes SERCOM_U2201 0x220 ERROR,RXBRK,CTSIC,RXS,TXC
samr21 yes SERCOM_U2201 0x200 ERROR,RXBRK,CTSIC,RXS,TXC
samr34 yes SERCOM_U2201 0x220 ERROR,RXBRK,CTSIC,RXS,TXC
samr35 yes SERCOM_U2201 0x220 ERROR,RXBRK,CTSIC,RXS,TXC

The uart_sam0_irq_update seems to be the only place where interrupt flags can be deasserted, and it would imply that for many of the listed SOCs, the flags would not unset after being set. https://github.com/zephyrproject-rtos/zephyr/blob/f6eeb9dc84bec63dea231a49e97992875b659614/drivers/serial/uart_sam0.c#L904-L928 The datasheet for samd21 does however state that:

An interrupt request is generated when the interrupt flag is set and if the corresponding interrupt is enabled. The interrupt request remains active until either the interrupt flag is cleared, the interrupt is disabled, or the USART is reset. For details on clearing interrupt flags, refer to the INTFLAG register description.

So I guess there could be different behaviours depending on the order of enabling and disabling rx/tx interrupts in e.g. uart_sam0_irq_rx/tx_enable/disable. I'm honestly not sure.

Taking a look at history/blame, there's not any hard evidence to find around uart_sam0_irq_update:

34432 - uart_sam0_irq_update is introduced, with a special case for REV_500

38311 - Minor change in assignment, guessing no effect for uart_sam0_irq_update

55387 - Adds caching of the TXC flag for REV_500

Now, I've only checked for what bits need to be cleared explicitly, but not if there's other functional difference between the REV_SERCOM revisions. It might be that checking if REV_SERCOM >= 0x200, but I have no way of knowing whether there are effects elsewhere directly related to error processing mechanisms.

Nonetheless, it would be nice to unbreak this for at least the samd21 and maybe others with the same REV_SERCOM.

Does anyone know what would be a good way to proceed in unbreaking this? I'm not super familiar with the zephyr codebase, so I don't really have a holistic view with regards to what else might be affected by this.

Environment

github-actions[bot] commented 1 year ago

Hi @atnon! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

nandojve commented 1 year ago

Hi @atnon ,

The easy way to fix this is you open a pull request with your changes. I recommend you take the time and look at Contributing to Zephyr in deep Contribution Guidelines.

Based on your analysis my recommendation is rework the Fixup

-#if defined(SERCOM_U2201) && (REV_SERCOM == 0x500) 
-#define SERCOM_REV500 
-#endif 

in favor of

+#if defined(SERCOM_U102)
+#define SAM0_SERCOM_HAS_ERROR_FLAGS = 0
+#else
+#define SAM0_SERCOM_HAS_ERROR_FLAGS = 1
+#endif 

and replace on code the #if defined(SERCOM_REV500). Besides that, any other detail that you discovered.

Once there is a PR it is possible to involve more people to help with tests.

atnon commented 11 months ago

I'll try to take a look at creating a PR for thiswhen time allows, hopefully sometime the coming week. Thanks for the input!

github-actions[bot] commented 9 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

jhedberg commented 9 months ago

@atnon any update? Thanks.

nandojve commented 8 months ago

Hi @atnon , We are closing next release and i was wondering if you will send a PR to fix your findings.

vlm-laba7 commented 7 months ago

I am experiencing the same issues on atsamd21g18a zephyr v3.5.0 - cpu gets stuck in uart_irq_update forever. Not sure why this is labeled as low priority

nandojve commented 7 months ago

Hi @atnon ,

Any update?

github-actions[bot] commented 5 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

github-actions[bot] commented 3 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.