zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.48k stars 6.41k forks source link

modem_cellular: Receive buffer overrun in ISR causes network stall #66171

Closed emillindq closed 8 months ago

emillindq commented 9 months ago

Describe the bug Using the recently implemented modem_cellular/PPP to connect to web services (over TCP & MQTT), using a U-blox SARA-R5 modem (baudrate 921600 with flowcontrol).

When stress-testing the system (MQTT publish, 300-ish bytes payload, as often as possible, not concurrent), eventually the following warning comes. NOTE: The actual message is Receive buffer overrun however I added ISR to clarify where it happens (it's in two places). It comes from modem_backend_uart_isr.c. image

At this point, net_l2_ppp continues to re-transmit its messages (send L2) but it never receives any. I can see on the actual UART communication that the modem replies. For example, among the PPP jibberish, I can see a DNS request, and I see the modem replies with the DNS response but it never reaches the PPP L2 stack. From this point on, no packets inbound are seen.

In addition, from this point the CEREG poll times out all the time, and on the MQTT stack I've seen "Transport read error" from mqtt_rx.c sometimes. image

Using STM32H743BIT6 (own PCB) with U-blox SARA-R5. Probably not a regression, always seen this problem.

To Reproduce Basically cause the uart receive buffer to overrun using TCP: stress-test the system, at a high baudrate.

Expected behavior UART overruns and dropped frames shouldn't cause receive to lock up. Impact Cellular modem not usable.

Environment (please complete the following information):

MaureenHelm commented 9 months ago

@bjarki-trackunit can you assess the priority please?

bjarki-andreasen commented 9 months ago

I'm already looking into it :)

bjarki-andreasen commented 9 months ago

Could you apply this PR #66239 and see if it allows the CMUX to recover?

emillindq commented 9 months ago

Could you apply this PR #66239 and see if it allows the CMUX to recover?

Thanks for this and sorry for delayed response, busy day yesterday...

I tried this out a bit yesterday and this morning, and it's not a fundamental difference. At one point the PR failed directly on initialization (DNS lookup/TLS handshake) and then second test after 40min. I've been running the same test without the PR and it has now been running for 26min.

This is the log from when the PR failed (forgive the "emils" log module, I had to disable PPP logging) image and then directly the periodic CEREG poll work times out, no more MQTT traffic image and then MQTT fails 20s later as well image

bjarki-andreasen commented 9 months ago

What hardware are you using? STM32, NRF53, ... and when it stops working, could you check the following:

  1. What value is the receive_state member of the modem_cmux instance?
  2. Is static void modem_cmux_receive_handler(struct k_work *item) in modem_cmux.c being called?
  3. Is static void modem_backend_uart_isr_irq_handler_receive_ready(struct modem_backend_uart *backend) in modem_backend_uart_isr.c being called?

This would be very helpful info to isolate the bug :)

emillindq commented 9 months ago

Using STM32h743BIT6 with U-blox SARA-R5.

Just some last-minute results before weekend. My test is to print requested info once every second, as well as in the ISR where the error happens (when it happens)

This happened right in the beginning when a HTTP client tried to resolve IP image no more logged activity of the modem things, before this periodic info image again image a few more times, this is the last one image until image then this image and this image

And while writing this, still printing image

emillindq commented 9 months ago

Code for periodic printer (I have a counter increase when functions are called) image and for the ISR image

bjarki-andreasen commented 9 months ago

Thanks for this! I will keep working towards finding the bug (or bugs, who knows) :)

emil-lindq commented 8 months ago

Hi @bjarki-trackunit, I found some time to get back to this.

I think I found the bug, the receiving end is in in MODEM_CMUX_RECEIVE_STATE_RESYNC_1 when we try to send an AT command. It turns out the bug is quite easy to provoke, just stall the system workq for a while, in my case 4 seconds, while it's communicating with the internet. It basically triggers when the modem sends data to us while we are stalled (I think), and get a buffer overrun.

The consequence is that it's allowed to send commands even if the receiving cmux is out of sync. What happens is the following in this state:

  1. We send a command over DLCI, for example AT+CESQ in this picture
  2. We receive the answer from the modem
  3. Just after we receive the first few bytes of the reply, we send 0x9F, 0x9F, 0x9F to resync
  4. Modem ignores this but we still receive a few 0x9F as part of the current message, causing us probably to go to different receive states but still always end up in RESYNC_1 image image

How do you propose we solve this? I'm a bit surprised it's even allowed to send any AT commands if the receiver it out of sync; it should resync before attempting to send any more commands right?

bjarki-andreasen commented 8 months ago

This should be fixed with https://github.com/zephyrproject-rtos/zephyr/pull/66239 which was merged today, except for the continuous sending, I will create a PR to block sending while out of sync :)

emil-lindq commented 8 months ago

This should be fixed with #66239 which was merged today, except for the continuous sending, I will create a PR to block sending while out of sync :)

FYI I pulled main as part of my other PR, and just happened to run these tests with that one merged

bjarki-andreasen commented 8 months ago

Try this patch https://github.com/zephyrproject-rtos/zephyr/pull/67216 I was able to reproduce the issue by blocking the work queue as well

emillindq commented 8 months ago

Try this patch #67216 I was able to reproduce the issue by blocking the work queue as well

It seems to be doing what you want it to do, however my modem doesn't seem to do what BG95 does. It doesn't reply to 9f 9f 9f however it seems to just drop it and continue like usual. Is the 9f 9f 9f part of "advanced mode"? I understand SARA-R5 might not implement those modes, regardless is there a more generally accepted method? I'm thinking other modems might behave similarly. Thanks!

https://www.3gpp.org/ftp/tsg_t/tsg_t/tsgt_04/docs/pdfs/TP-99119.pdf image

bjarki-andreasen commented 8 months ago

Ah, that explains it, I found the resync in this document section 5.1, and believed it to be in the standard, that's why the resync mechanism is only working on my BG95... I will look into updating the resync mechanism to just drop frames until it happens to be in sync again, see if that works.

jeffwelder-ellenbytech commented 8 months ago

I believe I'm experiencing something similar with my Telit Modem.

If another task runs long enough to get <wrn> modem_backend_uart: Receive buffer overrun, then I get hung up with <wrn> modem_chat: telit_me910g1_periodic_chat_script: timed out until the modem is rebooted.

Working through fixing my task, but commenting to follow the discussion.

bjarki-andreasen commented 8 months ago

Update:

I have reworked the resync mechanism to be passive, simply waiting for at least two consecutive flags (a stop followed by a start). Please test if this is working for your modems using PR #67382

This is working for my Quectel BG95.

A small note, I discovered where sending multiple frames when out of sync is sourced from: CMUX section 5.2.5 but I have no intention of relying on this since it appears its not universally used :)

emillindq commented 8 months ago

@bjarki-trackunit works great, good work! Recovers multiple times after each other using a 4 second delay on the workq

bjarki-andreasen commented 8 months ago

ah! finally!

bjarki-andreasen commented 8 months ago

@jeffwelder-ellenbytech could you confirm it is working for your modem as well?

jeffwelder-ellenbytech commented 8 months ago

It worked for me forcing an RX error in zephyr/subsys/modem/backends /modem_backend_uart_isr.c as suggested.

Attached is a snippet of the log:

[00:01:14.343,872] <wrn> modem_backend_uart: Receive buffer overrun
[00:01:14.355,773] <wrn> modem_backend_uart: Receive buffer overrun
[00:01:14.358,367] <wrn> modem_backend_uart: Receive buffer overrun
[00:01:14.358,581] <wrn> modem_cmux: Frame FCS error
[00:01:15.798,553] <wrn> modem_cmux: Dropped frame
[00:01:15.800,842] <wrn> modem_backend_uart: Receive buffer overrun
[00:01:18.334,197] <wrn> modem_chat: telit_me910g1_periodic_chat_script: timed out
[00:01:20.340,515] <wrn> modem_backend_uart: Receive buffer overrun
usb:~$ net ping 1.1.1.1
PING 1.1.1.1
[00:01:22.429,656] <wrn> modem_backend_uart: Receive buffer overrun
usb:~$ net28 bytes from 1.1.1.1 to 10.144.79.84: icmp_seq=1 ttl=48 time=319 ms
[00:01:23.627,197] <wrn> modem_backend_uart: Receive buffer overrun
usb:~$ net28 bytes from 1.1.1.1 to 10.144.79.84: icmp_seq=2 ttl=48 time=362 ms
28 bytes from 1.1.1.1 to 10.144.79.84: icmp_seq=3 ttl=48 time=161 ms

I was able to ping, and send CoAP packets past the error.

Thanks!