zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.49k stars 6.42k forks source link

uart async api does not provide all received data #49021

Closed Rocky0102 closed 1 year ago

Rocky0102 commented 2 years ago

I use CONFIG_UART_ASYNC_API, but when UART_RX_RDY event triggered, evt->data.rx.len is smaller than the number of bytes it has received. I use dslogic tool to analysis serial signal, and print evt->data.rx.len+2 and can be sure that data is all there, but evt->data.rx.len is just not correct.

To Reproduce related configuations in prj.conf: CONFIG_SERIAL=y CONFIG_CBPRINTF_FP_SUPPORT=y CONFIG_UART_ASYNC_API=y CONFIG_UART_1_ASYNC=y

uart handler: void uart1_handler_callback(const struct device dev, struct uart_event evt, void *user_data) { // int err; ARG_UNUSED(dev);

switch (evt->type) {
case UART_TX_DONE:
    sent = evt->data.tx.len;
    ilog("UART_TX_DONE: %d bytes sent.", sent);
    hex_dump2("UART_TX_DONE: ", evt->data.tx.buf, evt->data.tx.len);
    k_sem_give(&tx_done);
    break;
case UART_TX_ABORTED:
    sent = evt->data.tx.len;
    ilog("UART_TX_ABORTED: %d bytes sent.", sent);
    k_sem_give(&tx_aborted);
    break;
case UART_RX_RDY:
    received = evt->data.rx.len;
    ilog("UART_RX_RDY: %d bytes recved at 0x%x.", received, (unsigned int)(evt->data.rx.buf+evt->data.rx.offset));
    hex_dump2("UART_RX_RDY: ", evt->data.rx.buf+evt->data.rx.offset, received);
    k_sem_give(&rx_rdy);
    break;
case UART_RX_DISABLED:
    ilog("UART_RX_DISABLED");
    // err = k_sem_take(&rx_buf_released, K_NO_WAIT);
    // failed_in_isr |= (err < 0);
    k_sem_give(&rx_disabled);
    break;
default:
    break;
}

}

Logs and console output Some log: [17.438] ========== sending packet, id = 12 ========== [17.447] UART_TX_DONE: 34 bytes sent. [17.450] UART_TX_DONE: 34 0c 00 a5 86 1b 00 17 00 04 00 09 07 0b 00 02 4............... [17.458] UART_TX_DONE: 0c 00 29 2a 0d 00 02 0e 00 24 2a 0f 00 02 10 00 ..).....$..... [17.465] UART_TX_DONE: 27 2a '*

[18.853] ========== receiving packet, id = 12 ========== [18.856] UART_RX_RDY: 33 bytes recved at 0x20001b2c. [18.860] UART_RX_RDY: 34 0c 00 a5 86 1b 00 17 00 04 00 09 07 0b 00 02 4............... [18.868] UART_RX_RDY: 0c 00 29 2a 0d 00 02 0e 00 24 2a 0f 00 02 10 00 ..).....$..... [18.875] UART_RX_RDY: 27 ' 0x2a is there acturally.

[17.730] ========== sending packet, id = 13 ========== [17.738] UART_TX_DONE: 20 bytes sent. [17.741] UART_TX_DONE: 34 0d 00 43 71 0d 00 09 00 04 00 09 07 11 00 02 4..Cq........... [17.749] UART_TX_DONE: 12 00 28 2a ..(*

[19.145] ========== receiving packet, id = 13 ========== [19.148] UART_RX_RDY: 20 bytes recved at 0x20001b4d. [19.152] UART_RX_RDY: 2a 34 0d 00 43 71 0d 00 09 00 04 00 09 07 11 00 *4..Cq.......... [19.160] UART_RX_RDY: 02 12 00 28 ...( it should have received 21 bytes, but it tell me 20, so another 0x2a at the end of packet is delayed.

[18.030] ========== sending packet, id = 14 ========== [18.037] UART_TX_DONE: 16 bytes sent. [18.041] UART_TX_DONE: 34 0e 00 19 bb 09 00 05 00 04 00 01 08 12 00 0a 4...............

[19.442] ========== receiving packet, id = 14 ========== [19.445] UART_RX_RDY: 16 bytes recved at 0x20001b61. [19.450] UART_RX_RDY: 2a 34 0e 00 19 bb 09 00 05 00 04 00 01 08 12 00 *4.............. 0x0a is delayed too.

Impact When protocal packet has fixed sizes, I can use a litten trick to work around this, but in other cases it's blocking my way. It happens quite often, not always though, which means it should be a bug.

Environment (please complete the following information):

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

larson-whisper commented 1 year ago

I have been able to isolate the cause of this issue to the following:

When running in ASYNC mode and without HW_RX_COUNTING_ENABLED there is a case verified with external pin toggling on a logic analyzer and event tracing added to the UARTE driver where the endrx_isr will update the data->async->rx_total_user_byte_cnt field beyond the data->async->rx_cnt.cnt counter value kept by the NRF_UARTE_EVENT_RXDRDY checks. I am not 100% sure how this is happening. Perhaps there is a missing NRF_UARTE_EVENT_RXDRDY in the traces, or the ISR is being held too long such that multiple NRF_UARTE_EVENT_RXDRDY events collapse into one causing at least an off by one error in the data tracking. This off by one error ends up with the rx_timeout routine not being able to do it's job and notify the client that there is in fact one more byte of data in the RX buffer so it could be processed.

In our case this last byte of data was our EOF identifier which would cause one packet of data to be held waiting for the EOF to decode the packet. This would cause a command timeout which would get resolved once new UART data was TXed into the device.

I am able to resolve the issue as we have seen it by adding the following code into the endrx_isr path, which I assume is safe:

` / Original code / data->async->rx_total_user_byte_cnt += rx_len;

/* If we are not HW counting the data, data->async->rx_cnt.cnt should be at least what was reported to the user */
if(!HW_RX_COUNTING_ENABLED(data) && (data->async->rx_cnt.cnt < data->async->rx_total_user_byte_cnt)) {
    data->async->rx_cnt.cnt = data->async->rx_total_user_byte_cnt;
}

`

This addition works under the assumption that ALL RXRDY events for the prior buffer would be delivered and processed before the ENDRX EVENT is generated and processed. I am not 100% sure of this however. The check added above is counting on this fact, and empirically does resolve the observed issue with over 21000 1024 byte packets being sent without error for our use case, where as before the change we would see a stalled RX side every 200-1600 packets.

Thoughts and other insight welcome.

larson-whisper commented 1 year ago

Missing RXRDY Event

I have been able to capture a trace where we are missing the RXRDY event for one byte of data in between 2 ENDRX events. This is the eventual cause of the off by one error when we process the next ENDRX event in the driver. Given this capture, I think it is safe to keep the data->async->rx_cnt.cnt value up to date with the data->async->rx_total_user_byte_cnt as calculated in the ENDRX handler.

The missing RXRDY event should be between the 0x31 and 0x32 bytes on the UART nRF In line of the screenshot included.

However if the RXRDY event is dropped before a ENDRX will occur (no more data is coming in), the stall will still occur as there is no corresponding event that will update the count.

The setup of the debug pins on the LA is as follows:

DEBUG_PIN_TIMEOUT - Goes HIGH while running the UARTE rx_timeout function DEBUG_PIN_RXRDY - Goes high while processing the UARTE events (RXRDY, ENDRX, RXSTARTED).

Sorry that we do not have more pins than that available, so I had to double up on the UART events, but with experience it is still possible to read. Given that I have exonerated the rx_timeout behavior, I am happy to recapture a new trace and use the DEBUG_PIN_TIMEOUT pin for another use (dedicated RXRDY perhaps?)

larson-whisper commented 1 year ago

I changed the UARTE driver to count the number of RXRDY events in between each ENDRX event and the data corresponds to what is expected (except I do not see a case where there are more than 64 (the size of our RX buffers) RXRDY events per ENDRX as the code comments suggest may happen):

When the failure occurs I always see a transfer with 63 bytes instead of 64 bytes soon before the failure. This can always be correlated with the logic analyzer capture which shows we missed a RXRDY event.

Got ENDRX event after 64 RXRDY events. Got ENDRX event after 64 RXRDY events. Got ENDRX event after 64 RXRDY events. Got ENDRX event after 63 RXRDY events. Got ENDRX event after 64 RXRDY events. Got ENDRX event after 64 RXRDY events. Got ENDRX event after 64 RXRDY events.

larson-whisper commented 1 year ago

I have enabled a timer to enable the HW_RX_COUNTING_ENABLED check, and now I cannot see the RXRDY events (expected), and empirically the stall when receiving UART traffic does not happen.

@Rocky0102 - You may have the same success for your reported issue. I enabled the HW RX COUNT with the following additions to my project.conf file:

` CONFIG_UART_1_NRF_HW_ASYNC=y

CONFIG_UART_1_NRF_HW_ASYNC_TIMER=2 `

I believe this issue is still valid for the case where HW based RX counting is not enabled.

There are probably a number of holes with the SW counting mechanism:

  1. Lost RXRDY events are not recoverable since the AMOUNT register in the UART controller is not updated outside of ENDRX events from my experimentation. Perhaps this could be solved by introducing a flush in the rx_timeout handler to force the ENDRX to occur. Not 100% sure here.
  2. Disabling IRQs for greater than the duration of a single byte on the UART line for the given baud rate could result in multiple RXRDY events in a single notification, and if this is the case the SW handling could not possibly account for this.
  3. SW counting uses uint32_t counts, and for a long running system rollovers here do not appear to be handled well, and will likely cause issues as there are assumptions that do not appear at first glance to handle rollovers well (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/serial/uart_nrfx_uarte.c#L1066-L1071)

At this point we are proceeding with the PPI/Timer HW byte counting implementation and are counting on that working. Leaving these thoughts and notes here as breadcrumbs for others to follow when operating without HW RX byte counting.

warasilapm commented 1 year ago

@larson-whisper Thanks so much for digging into this. I ran into this immediately on an nrf52 device with a large amount of processing going on in their interrupt stacks and uart0 operating at 1 Mbaud. Without the hardware timers, I was only getting about half of the correct count. Once the hardware timer was enabled, everything worked perfectly. At 64 MHz and 1 Mbaud, my use case only gives a 576 clock cycles to handle the IRQ, so it's no surprise it was missing them.

@nordic-krch Clearly, this is an understood issue, as the KConfigs provided with the nrfx UART/UARTE drivers indicate that the counting may be incorrect with heavy interrupt use or for CONFIG_UART_X_NRF_HW_ASYNC. My use case was never going to work using the SW counting, but I still agree with @larson-whisper that the SW counting mechanism could probably use some improvements. Perhaps there's some way to imply or default to hardware counting for higher baud rates and improve the documentation on this issue?