zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

CONFIG_NO_OPTIMIZATION triggers a usage fault #12820

Closed SebastianBoe closed 5 years ago

SebastianBoe commented 5 years ago

Describe the bug After commit 251d99132d9aaabffbcf4b29fa8b273697f1f2d9 enabling CONFIG_NO_OPTIMIZATIONS will cause a usage fault to occur in samples/bluetooth/peripheral_hr.

To Reproduce Steps to reproduce the behavior:

  1. git checkout 251d99132d9aaabffbcf4b29fa8b273697f1f2d9 (or master, but this is the first bad commit)
  2. cd $ZEPHYR_BASE/samples/bluetooth/peripheral_hr
  3. echo CONFIG_NO_OPTIMIZATIONS=y >> prj.conf
  4. cmake -DBOARD=nrf52840_pca10056 -Boe -H. -GNinja
  5. ninja -C oe flash
  6. See Usage fault

Expected behavior Don't see usage fault.

Screenshots or console output

***** Booting Zephyr OS zephyr-v1.13.0-1194-g251d99132d *****
[00:00:00.008,819] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,850] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,850] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 99
[00:00:00.009,674] <wrn> bt_hci_core: bt_read_static_addr: No static addresses stored in controller
***** USAGE FAULT *****
  Illegal load of EXC_RETURN into PC
***** Hardware exception *****
Current thread ID = 0x20000388
Faulting instruction address = 0x1
Fatal fault in thread 0x20000388! Aborting.
[00:00:00.064,178] <inf> bt_hci_core: Identity: c4:2d:45:90:6f:54 (random)
[00:00:00.071,533] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.081,268] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
Bluetooth initialized
Advertising successfully started

Environment (please complete the following information):

Additional context 251d99132d9aaabffbcf4b29fa8b273697f1f2d9 authored by @jhedberg broke something.

SebastianBoe commented 5 years ago

@galak : Since this is a regression, and since disabling optimizations is a very common use-case in embedded debugging, I wouldn't prioritize this with 'low' priority.

jhedberg commented 5 years ago

@SebastianBoe all commit 251d991 does is that it doesn't anymore add some extra to Bluetooth stack sizes when logging is enabled (due to the new deferred logging subsystem the stack footprint should be much smaller), i.e. the most probable cause of the issue is a stack overflow. Could you check which stack (possibly multiple) is in question? I wonder if we should just leave this to app developers to set appropriate stack sizes in their prj.conf, or if there should be some conditions built into Kconfig and CONFIG_NO_OPTIMIZATION that influence the default stack sizes. I'm a bit cautious about the latter since there could be many things that influence the stack consumption, i.e. the conditionals may end up becoming quite complex if we try to support all possible scenarios.

SebastianBoe commented 5 years ago

I will check which stack is in question, though I do not know the way.

jhedberg commented 5 years ago

@SebastianBoe the most user friendly way to check stack consumption (that I've found) is using the "kernel stacks" shell command together with the following Kconfig options:

CONFIG_THREAD_STACK_INFO=y
CONFIG_THREAD_MONITOR=y
CONFIG_KERNEL_SHELL=y
CONFIG_INIT_STACKS=y
CONFIG_THREAD_NAME=y

Though the pre-requisite is that you first have sufficiently big stacks. I don't think you need to increase each one of them though - looking at your logs it seems likely that the culprit is the Bluetooth RX stack, i.e. CONFIG_BT_RX_STACK_SIZE needs increasing.

SebastianBoe commented 5 years ago

Thank you, will give it a go.

jhedberg commented 5 years ago

@SebastianBoe btw, instead of using peripheral_hr it might be quicker to start with tests/bluetooth/shell. Note that unlike the non-shell samples, you need to issue "bt init" to initialize Bluetooth

jhedberg commented 5 years ago

@SebastianBoe assuming you'll be able to narrow this down, and it's indeed a stack overflow, do you have any preferred ways to fix it? I.e. should we have the needed conditions in Kconfig or #ifdefs in the code, or should this simply be left up to app developers?

SebastianBoe commented 5 years ago

Dynamically sized stacks? ;)

I don't think there is a silver bullet, so I think we need to do a combination of;

improve stack protection (are we HW watching the memory area that follows the currently active stack?).

improve error reporting (Are we giving thread addresses, whilst thread names are known?)

improve Kconfig guesswork (Are our guesses reasonable, given the known configuration?)

But I don't know, I've been lucky enough to not have to debug too many blown stacks.

SebastianBoe commented 5 years ago

@jhedberg : For future reference.

Measuring the stack sizes wasn't very helpful, because the stack had already overflowed by the time I had time to check the sizes.

But this option gave me the thread that was corrupting the other threads:

CONFIG_STACK_SENTINEL

it produced this log output:

***** Stack Check Fail! *****
Current thread ID = 0x2000046c
Faulting instruction address = 0x347be

When looking up the thread ID in the zephyr.map file you get:

.bss.tx_thread_data

and when looking up the instruction in the zephyr.lst file you get:

void radio_rx_enable(void)
{
   347ba:   b580        push    {r7, lr}
   347bc:   af00        add r7, sp, #0
    nrf_radio_task_trigger(NRF_RADIO_TASK_RXEN);
   347be:   2004        movs    r0, #4
   347c0:   f7ff fdce   bl  34360 <nrf_radio_task_trigger>
}

which indicates that is a stack related to BT TX. But it's not clear to me how a user should find the Kconfig option BT_HCI_TX_STACK_SIZE based on this information.

I found it by cleaning up the offending commits history, and bisecting it until it was clear that changing BT_HCI_TX_STACK_SIZE caused the corruption.

Perhaps if the kernel knew both the thread name, and the Kconfig option that is controlling the thread's size, we would have something useful.

jhedberg commented 5 years ago

@SebastianBoe that's a good analysis, thanks, and I agree that it'd be nice if the kernel would be able to give more useful info on this kind of stuff, especially since it should know most of the info (except the Kconfig option, for now). I didn't know of CONFIG_STACK_SENTINEL previously - a good find! I wonder if all these tricks and options are documented anywhere? At least the options I mentioned I had to find by trial-and-error (which could also mean that I'm bad at searching for documentation :)

Btw, I suspect there might be other stacks in your configuration that are prone to overflow with CONFIG_NO_OPTIMIZATION, it's just that the TX thread is the first one to fail.

jhedberg commented 5 years ago

wonder if all these tricks and options are documented anywhere?

And by that I mean something more than the Kconfig help text. There could be a dedicated section in the debugging documentation about how to best analyze stack overflows and stack usage.

SebastianBoe commented 5 years ago

Calculating stack sizes is a difficult problem.

I'll fix it for this particular case, just as GCOV did[0], and leave it to the next guy to find a general solution.

[0] https://github.com/zephyrproject-rtos/zephyr/commit/e908ea9aa51ba7f88dff7dd610076b4dc352d3cf