Open lnhao opened 2 months ago
@lnhao without a reproduction case it may be quite difficult for anybody to guess what the problem is. Are you able to provide one?
Based on the limited information so far, there is not much that I can add yet. That being said, it looks like exception (which according to the 'reason' in the debug dump is a "precise data bus error") occurred at line 53 of swap.c in arch_swap(). This corresponds to ..
return _current->arch.swap_return_value;
Not being an ARM expert, I am uncertain what causes this error. However, given the code location I am suspicious of the _current
pointer. Perhaps this pointer became corrupt in some way?
One other thing of note as far as execution flow goes, slightly earlier in arch_swap(), we generated a PENDSV interrupt (necessary for scheduling). If I remember things correctly, PENDSV is the lowest priority interrupt, which makes it ideal for scheduling. And in arch_swap(), it does not get serviced until the interrupts are unlocked via irq_unlock(0);
.
Given the location, I do wonder if this is reproducible on other cortex-m platforms as this code is pretty foundational to it.
@aescolar @peter-mitsis thank you all. This problem is difficult to reproduce. Some devices have not been replicated yet. Some devices can be replicated after running for 24 hours. I cannot provide a sample at the moment; it will take some more time.
@lnhao - Figured I would check in and see if there is any additional information that can be added to this issue.
Hi @peter-mitsis @nashif @aescolar, another issue was found during the same program testing, where the I2C interrupt was continuously triggered after the device ran for 20 hours, causing other threads to be unable to run properly. And this problem is easy to reproduce. Then I tried to disable the I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n
, and after running for more than 10 days, it still works fine. The issue with this Hard Fault has also disappeared. I think Zephyr's STM32 I2C interrupt will malfunction after prolonged working.
Re-assigning to @erwango as this is looking like it is going in the direction of the STM32 I2C.
Maybe related #70077
Describe the bug The recurrence of this bug is irregular, and I don't know when it will occur. Some devices have been running for more than 4 days and are still functioning normally. But some devices may experience this after running for more than a day. The function entry I checked during debugging was generated from the k_msgq_get() function. There is very little other useful information. It's not like a thread stack overflow.
To Reproduce Steps to reproduce the behavior:
Expected behavior I don't think the program should enter the Hard Fault handler.
Impact The customer's device has also encountered this bug.
Logs and console output
Environment (please complete the following information):