nrf5340: Random crashes when a lot of interrupts is triggered

evenl commented 2 years ago

I have an application that has a few threads, uses the work queue regularly and uses rpmsg for communication between netcore and application core. This is software in development, and new stuff are added all the time. Lately we have seen random crashes or complete dead locks on the application core. The network core does not seem to be affected by this issue.

I have a delayable work item running on the network core that sends a message to the application core every second. If I disable this work item the issue get less frequent.

After some investigation I ended up in this function -> https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/logging/log_output.c#L627, I saw that I was always able to run down to line 644, but never past line 645. I tried to remove line 645 and it helped a lot, but it did not remove the issue, it was just less frequent. So I thought that we some how spent to much time in there. I saw in some cases (using k_uptime_get()) that we spent 60ms in this function. At this time we was running NCS 1.7.0, I saw that in NCS 1.9.1 async uart support was added to the UART logger backen. I wanted to try that to maybe lower the number of CPU cycles spent in this function.

Upgrading the project was pretty straight forward, but when I enabled the async uart backend the problem got a lot worse. To me it seems to be related to the number of interrupts being fired. As before, it I disable the message that the network core sends every second, the issue get less frequent.

If I enable CONFIG_ASSERT=y, I sometimes get this message when the application crashes.

Sometimes it freezes and I must use nrfjprog --recover to get access to the device again. Sometimes it just reboots without any error message. Sometimes a double bus fault

Seems pretty random.

Unfortunately I don't currently have an application that reproduces the issue on an nrf5340dk. I also managed to make a few coredumps on ncs 1.7.0 (haven't tried on ncs 1.9.0 yet), but I am not sure if these make much sense without the elf file.

To me it looks kind of similar to https://github.com/zephyrproject-rtos/zephyr/issues/44349 https://github.com/zephyrproject-rtos/zephyr/issues/30074

On NCS 1.7.0 it almost always failed in the ipm_work_q thread, on NCS 1.9.1 it usually lists either the sysworkq or flash_id which is one of my application threads. I uses the system work queue to communicate with a led driver device over I2C regularly and the flash_id thread is running checking a message queue for data (which is always empty in this case) checks for data on the i2s module, and then sleeps for 1 ms.

evenl commented 2 years ago

I was finally able to attach to the device when it get stuck (dead lock), when I compiled the project with optimize for debug. This is the back trace:

(gdb) bt
#0  0x00009460 in set_absolute_alarm (chan=chan@entry=0, abs_val=abs_val@entry=4564014) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:217
#1  0x000046b8 in compare_set_nolocks (chan=chan@entry=0, target_time=<optimized out>, handler=handler@entry=0x45b9 <sys_clock_timeout_handler>, user_data=user_data@entry=0x0 <z_impl_z_errno>) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:280
#2  0x000094ea in compare_set (chan=chan@entry=0, target_time=<optimized out>, handler=handler@entry=0x45b9 <sys_clock_timeout_handler>, user_data=user_data@entry=0x0 <z_impl_z_errno>) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:304
#3  0x000048a8 in sys_clock_set_timeout (ticks=<optimized out>, idle=idle@entry=false) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:566
#4  0x00007346 in z_add_timeout (to=to@entry=0x200002b8 <log_process_thread_timer>, fn=fn@entry=0x7461 <z_timer_expiration_handler>, timeout=...) at /home/ted/repo/nomono/zephyr/kernel/timeout.c:135
#5  0x00007570 in z_impl_k_timer_start (timer=timer@entry=0x200002b8 <log_process_thread_timer>, duration=..., period=...) at /home/ted/repo/nomono/zephyr/kernel/timer.c:130
#6  0x0000162a in k_timer_start (period=..., duration=..., timer=0x200002b8 <log_process_thread_timer>) at zephyr/include/generated/syscalls/kernel.h:441
#7  z_log_msg_post_finalize () at /home/ted/repo/nomono/zephyr/subsys/logging/log_core.c:236
#8  0x00001774 in msg_finalize (src_level=..., msg=0x20007ac8 <log_msg_pool_buf+992>) at /home/ted/repo/nomono/zephyr/subsys/logging/log_core.c:267
#9  log_1 (str=str@entry=0xad48 "Received message of %u bytes.", arg0=10, src_level=...) at /home/ted/repo/nomono/zephyr/subsys/logging/log_core.c:296
#10 0x000004f0 in endpoint_cb (ept=<optimized out>, data=<optimized out>, len=<optimized out>, src=<optimized out>, priv=0x0 <z_impl_z_errno>) at ../src/iccom_master.c:41
#11 0x00009fee in rpmsg_virtio_rx_callback (vq=<optimized out>) at /home/ted/repo/nomono/modules/lib/open-amp/open-amp/lib/rpmsg/rpmsg_virtio.c:503
#12 0x00009c20 in virtqueue_notification (vq=<optimized out>) at /home/ted/repo/nomono/modules/lib/open-amp/open-amp/lib/virtio/virtqueue.c:705
#13 0x00001eda in ipm_callback_process (work=<optimized out>) at /home/ted/repo/nomono/zephyr/subsys/ipc/rpmsg_service/rpmsg_backend.c:152
#14 0x00006860 in work_queue_main (workq_ptr=0x20000390 <ipm_work_q>, p2=<optimized out>, p3=<optimized out>) at /home/ted/repo/nomono/zephyr/kernel/work.c:649
#15 0x00007ca2 in z_thread_entry (entry=0x67d9 <work_queue_main>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/ted/repo/nomono/zephyr/lib/os/thread_entry.c:36
#16 0x00007ca2 in z_thread_entry (entry=0x67d9 <work_queue_main>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/ted/repo/nomono/zephyr/lib/os/thread_entry.c:36

And info registers:

(gdb) info registers
r0             0xc4a427            12887079
r1             0x45a42e            4564014
r2             0xc4a427            12887079
r3             0x50015000          1342263296
r4             0x45a42e            4564014
r5             0x45a42e            4564014
r6             0x0                 0
r7             0xc4a427            12887079
r8             0x10005             65541
r9             0x7fff              32767
r10            0x0                 0
r11            0x0                 0
r12            0xfffffffe          4294967294
sp             0x20001bd8          0x20001bd8 <ipm_stack_area+1744>
lr             0x945f              37983
pc             0x9460              0x9460 <set_absolute_alarm+14>
xpsr           0x21000000          553648128
msp            0x20004648          536888904
psp            0x20001bd8          536878040
primask        0x0                 0
basepri        0x20                32
faultmask      0x0                 0
control        0x2                 2
fpscr          0x0                 0
msp_s          0x20004648          536888904
msplim_s       0x20003e48          536886856
psp_s          0x20001bd8          536878040
psplim_s       0x20001508          536876296
msp_ns         0x0                 0
msplim_ns      0x0                 0
psp_ns         0xfffffffc          4294967292
psplim_ns      0x0                 0
control_s      0x2                 2
faultmask_s    0x0                 0
basepri_s      0x20                32
primask_s      0x0                 0
control_ns     0x0                 0
faultmask_ns   0x0                 0
basepri_ns     0x0                 0
primask_ns     0x0                 0

carlescufi commented 2 years ago

As usual, the first thing to check is stack overflow. Can you please double all stacks (CONFIG_**_STACK_SIZE) and see if you can still reproduce?

evenl commented 2 years ago

I have tried, but lets repeat to be sure.

evenl commented 2 years ago

CONFIG_MAIN_STACK_SIZE=16365
CONFIG_IDLE_STACK_SIZE=1024
CONFIG_ISR_STACK_SIZE=4096
CONFIG_RPMSG_SERVICE_WORK_QUEUE_STACK_SIZE=4096

These are the stack I have doubled. With the new stack sizes I still see locks and crashes.

I am using this application to recreate the issue: https://drive.google.com/file/d/13K27EFmfROnrtFlz90qgHRxxeQ1VxWaZ/view?usp=sharing

I am using NCS 1.9.1, but I saw the same issue on NCS 1.7.0

I have not tested this application on an nrf5340dk yet, so I am not sure if it will work there. And the issue seems to have very different frequency on different devices, we where just lucky and found a "golden sample" that has a high frequency of hitting this issue.

anangl commented 2 years ago

I tried to reproduce the issue with the above application on nRF5340 DK, using NCS 1.9.1, but I did not succeed so far. The application seems to run just fine (I waited for about 30 minutes and did not see it crashing), also when I shorten the delay for the work item submitted on the network core (I tried with e.g. 200 or 10 ms). Should I perform some additional action or try to change some configuration options to trigger the crash?

evenl commented 2 years ago

Sorry for the slow response, I have been on Easter vacation and thanks for looking at my case.

Yes, it's is very hard to reproduce this issue in a controlled environment. And as mentioned before it seems to be hardware dependent, currently we have found two devices here at the office that show this behavior, but all devices show this behavior over time. It also seems to be very dependent on the timings in the code how often the issue appear, but it will always appear at some point.

If I change the timing as you have done I can make the issue seems to disappear, but it will just happen less frequently. That is what makes this issue so hard, because if I enable stack canaries or do some other measures to try to debug this issue the issue may be less frequent.

In some cases I have had "luck" by enabling things that creates interrupts, for instance I enabled async uart which made our firmware crash all the time. So it may be that you can try something like that (I was trying to use the async uart backen in the logger). But it is very easy move the frequency of the issue in both directions by changing the code.

I have not yet received the nrf5340dk kits I ordered, so I can not do any testing on my end on a nrf5340dk kit. I am not sure if it would help if I made a recording of how the device fails, or if we could to a remote session so I could show you how the issue behaves?

evenl commented 2 years ago

After some more testing, it seems like the bug may be located or related to the logging system. In my setup I am using the UART backen.

I have tried to run the device that usually fails within 15 - 20 seconds without logging (CONFIG_LOG=n) and it ran for 12 hours before I stopped it manually. Enabling logging again would make it fail within 15 -20 seconds as expected.

I am also running a test now with two devices, one running my test application and one running our production firmware, both with logging set to minimal mode. They have now been running for 5.5 hours, which is far beyond what any of these devices has been able to do before.

evenl commented 2 years ago

After some more investigation it seems that the issue we are seeing is the result of the EasyDMA in one of the UARTs and the CPU is accessing the same memory. This issue originates from our own code where we are trying to use on of the UARTs as a slave on a multi device single wire uart bus. I'll close this issue now as this is not related to any Zephyr code.

SAOKIM99 commented 3 months ago

I'm also facing the same situation, the application core crashes randomly after a while of running, the network core still works. Can you provide some more solutions to fix this situation? @evenl

zephyrproject-rtos / zephyr

nrf5340: Random crashes when a lot of interrupts is triggered #44586