Closed SebastianBoe closed 5 years ago
@galak : Since this is a regression, and since disabling optimizations is a very common use-case in embedded debugging, I wouldn't prioritize this with 'low' priority.
@SebastianBoe all commit 251d991 does is that it doesn't anymore add some extra to Bluetooth stack sizes when logging is enabled (due to the new deferred logging subsystem the stack footprint should be much smaller), i.e. the most probable cause of the issue is a stack overflow. Could you check which stack (possibly multiple) is in question? I wonder if we should just leave this to app developers to set appropriate stack sizes in their prj.conf, or if there should be some conditions built into Kconfig and CONFIG_NO_OPTIMIZATION that influence the default stack sizes. I'm a bit cautious about the latter since there could be many things that influence the stack consumption, i.e. the conditionals may end up becoming quite complex if we try to support all possible scenarios.
I will check which stack is in question, though I do not know the way.
@SebastianBoe the most user friendly way to check stack consumption (that I've found) is using the "kernel stacks" shell command together with the following Kconfig options:
CONFIG_THREAD_STACK_INFO=y
CONFIG_THREAD_MONITOR=y
CONFIG_KERNEL_SHELL=y
CONFIG_INIT_STACKS=y
CONFIG_THREAD_NAME=y
Though the pre-requisite is that you first have sufficiently big stacks. I don't think you need to increase each one of them though - looking at your logs it seems likely that the culprit is the Bluetooth RX stack, i.e. CONFIG_BT_RX_STACK_SIZE
needs increasing.
Thank you, will give it a go.
@SebastianBoe btw, instead of using peripheral_hr it might be quicker to start with tests/bluetooth/shell
. Note that unlike the non-shell samples, you need to issue "bt init" to initialize Bluetooth
@SebastianBoe assuming you'll be able to narrow this down, and it's indeed a stack overflow, do you have any preferred ways to fix it? I.e. should we have the needed conditions in Kconfig or #ifdefs in the code, or should this simply be left up to app developers?
Dynamically sized stacks? ;)
I don't think there is a silver bullet, so I think we need to do a combination of;
improve stack protection (are we HW watching the memory area that follows the currently active stack?).
improve error reporting (Are we giving thread addresses, whilst thread names are known?)
improve Kconfig guesswork (Are our guesses reasonable, given the known configuration?)
But I don't know, I've been lucky enough to not have to debug too many blown stacks.
@jhedberg : For future reference.
Measuring the stack sizes wasn't very helpful, because the stack had already overflowed by the time I had time to check the sizes.
But this option gave me the thread that was corrupting the other threads:
CONFIG_STACK_SENTINEL
it produced this log output:
***** Stack Check Fail! *****
Current thread ID = 0x2000046c
Faulting instruction address = 0x347be
When looking up the thread ID in the zephyr.map file you get:
.bss.tx_thread_data
and when looking up the instruction in the zephyr.lst file you get:
void radio_rx_enable(void)
{
347ba: b580 push {r7, lr}
347bc: af00 add r7, sp, #0
nrf_radio_task_trigger(NRF_RADIO_TASK_RXEN);
347be: 2004 movs r0, #4
347c0: f7ff fdce bl 34360 <nrf_radio_task_trigger>
}
which indicates that is a stack related to BT TX. But it's not clear to me how a user should find the Kconfig option BT_HCI_TX_STACK_SIZE based on this information.
I found it by cleaning up the offending commits history, and bisecting it until it was clear that changing BT_HCI_TX_STACK_SIZE caused the corruption.
Perhaps if the kernel knew both the thread name, and the Kconfig option that is controlling the thread's size, we would have something useful.
@SebastianBoe that's a good analysis, thanks, and I agree that it'd be nice if the kernel would be able to give more useful info on this kind of stuff, especially since it should know most of the info (except the Kconfig option, for now). I didn't know of CONFIG_STACK_SENTINEL previously - a good find! I wonder if all these tricks and options are documented anywhere? At least the options I mentioned I had to find by trial-and-error (which could also mean that I'm bad at searching for documentation :)
Btw, I suspect there might be other stacks in your configuration that are prone to overflow with CONFIG_NO_OPTIMIZATION, it's just that the TX thread is the first one to fail.
wonder if all these tricks and options are documented anywhere?
And by that I mean something more than the Kconfig help text. There could be a dedicated section in the debugging documentation about how to best analyze stack overflows and stack usage.
Calculating stack sizes is a difficult problem.
I'll fix it for this particular case, just as GCOV did[0], and leave it to the next guy to find a general solution.
[0] https://github.com/zephyrproject-rtos/zephyr/commit/e908ea9aa51ba7f88dff7dd610076b4dc352d3cf
Describe the bug After commit 251d99132d9aaabffbcf4b29fa8b273697f1f2d9 enabling CONFIG_NO_OPTIMIZATIONS will cause a usage fault to occur in samples/bluetooth/peripheral_hr.
To Reproduce Steps to reproduce the behavior:
Expected behavior Don't see usage fault.
Screenshots or console output
Environment (please complete the following information):
Additional context 251d99132d9aaabffbcf4b29fa8b273697f1f2d9 authored by @jhedberg broke something.