Closed evenl closed 2 years ago
I was finally able to attach to the device when it get stuck (dead lock), when I compiled the project with optimize for debug. This is the back trace:
(gdb) bt
#0 0x00009460 in set_absolute_alarm (chan=chan@entry=0, abs_val=abs_val@entry=4564014) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:217
#1 0x000046b8 in compare_set_nolocks (chan=chan@entry=0, target_time=<optimized out>, handler=handler@entry=0x45b9 <sys_clock_timeout_handler>, user_data=user_data@entry=0x0 <z_impl_z_errno>) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:280
#2 0x000094ea in compare_set (chan=chan@entry=0, target_time=<optimized out>, handler=handler@entry=0x45b9 <sys_clock_timeout_handler>, user_data=user_data@entry=0x0 <z_impl_z_errno>) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:304
#3 0x000048a8 in sys_clock_set_timeout (ticks=<optimized out>, idle=idle@entry=false) at /home/ted/repo/nomono/zephyr/drivers/timer/nrf_rtc_timer.c:566
#4 0x00007346 in z_add_timeout (to=to@entry=0x200002b8 <log_process_thread_timer>, fn=fn@entry=0x7461 <z_timer_expiration_handler>, timeout=...) at /home/ted/repo/nomono/zephyr/kernel/timeout.c:135
#5 0x00007570 in z_impl_k_timer_start (timer=timer@entry=0x200002b8 <log_process_thread_timer>, duration=..., period=...) at /home/ted/repo/nomono/zephyr/kernel/timer.c:130
#6 0x0000162a in k_timer_start (period=..., duration=..., timer=0x200002b8 <log_process_thread_timer>) at zephyr/include/generated/syscalls/kernel.h:441
#7 z_log_msg_post_finalize () at /home/ted/repo/nomono/zephyr/subsys/logging/log_core.c:236
#8 0x00001774 in msg_finalize (src_level=..., msg=0x20007ac8 <log_msg_pool_buf+992>) at /home/ted/repo/nomono/zephyr/subsys/logging/log_core.c:267
#9 log_1 (str=str@entry=0xad48 "Received message of %u bytes.", arg0=10, src_level=...) at /home/ted/repo/nomono/zephyr/subsys/logging/log_core.c:296
#10 0x000004f0 in endpoint_cb (ept=<optimized out>, data=<optimized out>, len=<optimized out>, src=<optimized out>, priv=0x0 <z_impl_z_errno>) at ../src/iccom_master.c:41
#11 0x00009fee in rpmsg_virtio_rx_callback (vq=<optimized out>) at /home/ted/repo/nomono/modules/lib/open-amp/open-amp/lib/rpmsg/rpmsg_virtio.c:503
#12 0x00009c20 in virtqueue_notification (vq=<optimized out>) at /home/ted/repo/nomono/modules/lib/open-amp/open-amp/lib/virtio/virtqueue.c:705
#13 0x00001eda in ipm_callback_process (work=<optimized out>) at /home/ted/repo/nomono/zephyr/subsys/ipc/rpmsg_service/rpmsg_backend.c:152
#14 0x00006860 in work_queue_main (workq_ptr=0x20000390 <ipm_work_q>, p2=<optimized out>, p3=<optimized out>) at /home/ted/repo/nomono/zephyr/kernel/work.c:649
#15 0x00007ca2 in z_thread_entry (entry=0x67d9 <work_queue_main>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/ted/repo/nomono/zephyr/lib/os/thread_entry.c:36
#16 0x00007ca2 in z_thread_entry (entry=0x67d9 <work_queue_main>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/ted/repo/nomono/zephyr/lib/os/thread_entry.c:36
And info registers:
(gdb) info registers
r0 0xc4a427 12887079
r1 0x45a42e 4564014
r2 0xc4a427 12887079
r3 0x50015000 1342263296
r4 0x45a42e 4564014
r5 0x45a42e 4564014
r6 0x0 0
r7 0xc4a427 12887079
r8 0x10005 65541
r9 0x7fff 32767
r10 0x0 0
r11 0x0 0
r12 0xfffffffe 4294967294
sp 0x20001bd8 0x20001bd8 <ipm_stack_area+1744>
lr 0x945f 37983
pc 0x9460 0x9460 <set_absolute_alarm+14>
xpsr 0x21000000 553648128
msp 0x20004648 536888904
psp 0x20001bd8 536878040
primask 0x0 0
basepri 0x20 32
faultmask 0x0 0
control 0x2 2
fpscr 0x0 0
msp_s 0x20004648 536888904
msplim_s 0x20003e48 536886856
psp_s 0x20001bd8 536878040
psplim_s 0x20001508 536876296
msp_ns 0x0 0
msplim_ns 0x0 0
psp_ns 0xfffffffc 4294967292
psplim_ns 0x0 0
control_s 0x2 2
faultmask_s 0x0 0
basepri_s 0x20 32
primask_s 0x0 0
control_ns 0x0 0
faultmask_ns 0x0 0
basepri_ns 0x0 0
primask_ns 0x0 0
As usual, the first thing to check is stack overflow. Can you please double all stacks (CONFIG_**_STACK_SIZE
) and see if you can still reproduce?
I have tried, but lets repeat to be sure.
CONFIG_MAIN_STACK_SIZE=16365
CONFIG_IDLE_STACK_SIZE=1024
CONFIG_ISR_STACK_SIZE=4096
CONFIG_RPMSG_SERVICE_WORK_QUEUE_STACK_SIZE=4096
These are the stack I have doubled. With the new stack sizes I still see locks and crashes.
I am using this application to recreate the issue: https://drive.google.com/file/d/13K27EFmfROnrtFlz90qgHRxxeQ1VxWaZ/view?usp=sharing
I am using NCS 1.9.1, but I saw the same issue on NCS 1.7.0
I have not tested this application on an nrf5340dk yet, so I am not sure if it will work there. And the issue seems to have very different frequency on different devices, we where just lucky and found a "golden sample" that has a high frequency of hitting this issue.
I tried to reproduce the issue with the above application on nRF5340 DK, using NCS 1.9.1, but I did not succeed so far. The application seems to run just fine (I waited for about 30 minutes and did not see it crashing), also when I shorten the delay for the work item submitted on the network core (I tried with e.g. 200 or 10 ms). Should I perform some additional action or try to change some configuration options to trigger the crash?
Sorry for the slow response, I have been on Easter vacation and thanks for looking at my case.
Yes, it's is very hard to reproduce this issue in a controlled environment. And as mentioned before it seems to be hardware dependent, currently we have found two devices here at the office that show this behavior, but all devices show this behavior over time. It also seems to be very dependent on the timings in the code how often the issue appear, but it will always appear at some point.
If I change the timing as you have done I can make the issue seems to disappear, but it will just happen less frequently. That is what makes this issue so hard, because if I enable stack canaries or do some other measures to try to debug this issue the issue may be less frequent.
In some cases I have had "luck" by enabling things that creates interrupts, for instance I enabled async uart which made our firmware crash all the time. So it may be that you can try something like that (I was trying to use the async uart backen in the logger). But it is very easy move the frequency of the issue in both directions by changing the code.
I have not yet received the nrf5340dk kits I ordered, so I can not do any testing on my end on a nrf5340dk kit. I am not sure if it would help if I made a recording of how the device fails, or if we could to a remote session so I could show you how the issue behaves?
After some more testing, it seems like the bug may be located or related to the logging system. In my setup I am using the UART backen.
I have tried to run the device that usually fails within 15 - 20 seconds without logging (CONFIG_LOG=n) and it ran for 12 hours before I stopped it manually. Enabling logging again would make it fail within 15 -20 seconds as expected.
I am also running a test now with two devices, one running my test application and one running our production firmware, both with logging set to minimal mode. They have now been running for 5.5 hours, which is far beyond what any of these devices has been able to do before.
After some more investigation it seems that the issue we are seeing is the result of the EasyDMA in one of the UARTs and the CPU is accessing the same memory. This issue originates from our own code where we are trying to use on of the UARTs as a slave on a multi device single wire uart bus. I'll close this issue now as this is not related to any Zephyr code.
I'm also facing the same situation, the application core crashes randomly after a while of running, the network core still works. Can you provide some more solutions to fix this situation? @evenl
I have an application that has a few threads, uses the work queue regularly and uses rpmsg for communication between netcore and application core. This is software in development, and new stuff are added all the time. Lately we have seen random crashes or complete dead locks on the application core. The network core does not seem to be affected by this issue.
I have a delayable work item running on the network core that sends a message to the application core every second. If I disable this work item the issue get less frequent.
After some investigation I ended up in this function -> https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/logging/log_output.c#L627, I saw that I was always able to run down to line 644, but never past line 645. I tried to remove line 645 and it helped a lot, but it did not remove the issue, it was just less frequent. So I thought that we some how spent to much time in there. I saw in some cases (using k_uptime_get()) that we spent 60ms in this function. At this time we was running NCS 1.7.0, I saw that in NCS 1.9.1 async uart support was added to the UART logger backen. I wanted to try that to maybe lower the number of CPU cycles spent in this function.
Upgrading the project was pretty straight forward, but when I enabled the async uart backend the problem got a lot worse. To me it seems to be related to the number of interrupts being fired. As before, it I disable the message that the network core sends every second, the issue get less frequent.
If I enable CONFIG_ASSERT=y, I sometimes get this message when the application crashes.
Sometimes it freezes and I must use nrfjprog --recover to get access to the device again. Sometimes it just reboots without any error message. Sometimes a double bus fault
Seems pretty random.
Unfortunately I don't currently have an application that reproduces the issue on an nrf5340dk. I also managed to make a few coredumps on ncs 1.7.0 (haven't tried on ncs 1.9.0 yet), but I am not sure if these make much sense without the elf file.
To me it looks kind of similar to https://github.com/zephyrproject-rtos/zephyr/issues/44349 https://github.com/zephyrproject-rtos/zephyr/issues/30074
On NCS 1.7.0 it almost always failed in the ipm_work_q thread, on NCS 1.9.1 it usually lists either the sysworkq or flash_id which is one of my application threads. I uses the system work queue to communicate with a led driver device over I2C regularly and the flash_id thread is running checking a message queue for data (which is always empty in this case) checks for data on the i2s module, and then sleeps for 1 ms.