zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

Breakpoints in Bluetooth IPSP cause kernel oops #11120

Closed sectyler closed 5 years ago

sectyler commented 5 years ago

I am trying to use and debug TLS/DTLS sockets over Bluetooth using the nrf52840_pca10056 development board but am unable to hit breakpoints without crashing the kernel. I have reproduced the problem using the IPSP example provided with Zephyr but have been unable to correct it. The problem occurs when debugging in CLI as well as using Eclipse. Any insight on correcting or working around the issue would be greatly appreciated!

Steps to recreate

Development OS: Ubuntu 16.04 Board: nrf52840_pca10056 Commit: 2d7825976a988d110d91a3bc11423784f9d1fdcc Zephyr SDK Version: 0.9.5

  1. I built and ran the IPSP example project with no source code modifications using the commands:

    $ cmake -G"Unix Makefiles" -DBOARD=nrf52840_pca10056 ~/zephyr/samples/bluetooth/ipsp
    $ make -j4
    $ make debug
    (gdb) c
  2. I created the Bluetooth 6LoWPAN interface according to the IPSP Sample README:

    $ sudo su
    # modprobe bluetooth_6lowpan
    # echo 1 > /sys/kernel/debug/bluetooth/6lowpan_enable
    # echo "connect <bdaddr> 2" > /sys/kernel/debug/bluetooth/6lowpan_control
    # ip address add 2001:db8::2/64 dev bt0
  3. At this point I can reliably use the echo server using the command:

    $ nc 2001:db8::1 4242
  4. The problem arises when pausing the debugger by pressing CTRL-C in GDB or hitting a breakpoint once the Bluetooth interface is up (e.g. set a breakpoint at main.c:312). When one of these happens:

    • bluetoothctl shows that the device is no longer paired or connected
    • the bt0 interface is no longer present
    • upon resuming execution a "Kernel OOPS!" message appears in the console with faulting instruction address 0x131aa (~/zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c:8230)
  5. At this point I am forced to restart the board and recreate the bt0 interface to use the application again.

pfalcon commented 5 years ago

What about other samples?

sectyler commented 5 years ago

I have not worked much with the other Bluetooth samples, but trying it with the scan_adv sample resulted in similar behavior.

[00:00:18.176,116] <err> bt_ctlr_llsw_ctrl.event_scan_prepare: assert: '!_radio.ticker_id_prepare' failed
***** Kernel OOPS! *****
Current thread ID = 0x20000ef0
Faulting instruction address = 0x7d1a
Fatal fault in ISR! Spinning...

This faulting address corresponds to ~/zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c:6437

pfalcon commented 5 years ago

I have not worked much with the other Bluetooth samples, but trying it with the scan_adv sample resulted in similar behavior.

That's not even what I mean. Can you confirm that there's any sample which behaves in that regard how you expect of it? We'd need clear info on that to rule out user or systematic tool error. Please try hello_world or similar simple sample.

Anyway, I may imagine what may happen: when you hit breakpoint, hardware continues to run, and it runs for "too much", and then software when resumed finds it in inconsistent state wrt to its expectations and crashes. Not sure if it's possible or makes sense to do something there, but let BT people decide.

@carlescufi: Can you see if this goes to nrf52 people or to BT people and assign it accordingly?

carlescufi commented 5 years ago

@sectyler As @pfalcon mentions I don't think this is a bug, let me explain. Bluetooth requires hard real-time timing intervals to be respected. If they are not, there are guards in the code that will trigger a kernel oops in order to warn the developer that something was done in the background that delayed the ISR execution beyond what is admissible as per the Bluetooth specification. This is important because in some cases the application might be disabling interrupts for too long, and this needs to be communicated to the user somehow (a kernel oops). There is absolutely nothing we can do about this, because when you trigger a breakpoint you are essentially stopping code execution while letting time run free, and that is simply not compatible with hard real-time.

There might be something we can do, but only in certain cases. Note that there is also an upcoming refactor to the Link Layer that will allow a bit more flexibility when it comes to meeting hard real-time deadlines, but even then I would expect a breakpoint to cause an oops.

CC @cvinayak

sectyler commented 5 years ago

@carlescufi Thank you, your response is very informative and helpful. If I understand correctly, this means when trying to debug an application using Bluetooth I cannot effectively use a debugger like GDB and am limited to logs and print statements.

Are there other tools/techniques Bluetooth developers use to debug more effectively? I came across the problem trying to debug a TLS or DTLS echo server over Bluetooth.

Is there a more appropriate forum for this conversation (through Zephyr, Nordic, or elsewhere) since it is not a bug in Zephyr?

carlescufi commented 5 years ago

Are there other tools/techniques Bluetooth developers use to debug more effectively? I came across the problem trying to debug a TLS or DTLS echo server over Bluetooth.

You can use GDB, but you are limited to then a single breakpoint, you cannot continue. There are no other real methods beyond LOG() (which will work fine now that it is deferred) or tracing via pins. There is also RTOS tracing functionality that you can use with Segger SystemView.

Is there a more appropriate forum for this conversation (through Zephyr, Nordic, or elsewhere) since it is not a bug in Zephyr?

No, this is the right place. I will mark this issue as a question

carlescufi commented 5 years ago

CC @mike-scott who has experience debugging IPSP

carlescufi commented 5 years ago

Closing since this is expected