BT controller occasionally generates imprecise data abort

nashif commented 7 years ago

Reported by Michael Scott:

Using Nitrogen and BT gateway hardware running 4.11.0 kernel, I am running a BT 6lowpan Zephyr application which sends MQTT traffic every 5 seconds.

Eventually see: BUS FAULT Executing thread ID (thread): 0x2000eff8 Faulting instruction address: 0x147e0 Imprecise data bus error Fatal fault in ISR! Spinning...

When I check addr2line I see that this is: zephyr/subsys/bluetooth/controller/util/memq.c:30 "*tail = link;"

If I hack together MPU support for nRF52 with the attached patch, then I actually see an invalid memory read a bit earlier in "memq_enqueue": MPU FAULT Executing thread ID (thread): 0x2000f058 Faulting instruction address: 0x147da Data Access Violation Address: 0xe12fc55b Fatal fault in ISR! Spinning...

Using addr2line shows: zephyr/subsys/bluetooth/controller/util/memq.c:24: "*((void *)tail) = link;"

When I trace where memq_enqueue is being called from I show: zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c: "packet_rx_enqueue" and the parameters being passed in 2 successive calls:

packet_rx_enqueue: mem:0x20003b30, link:0x23b19c83, &link_rx_tail:0x200036d8, link_rx_tail:0x20003b60 packet_rx_enqueue: mem:0x20003920, link:0x20003b80, &link_rx_tail:0x200036d8, link_rx_tail:0x23b19c83

In the first call we have an invalid link address being added onto the link_rx_tail (doesn't cause an abort). And then the next call we add a valid link address onto the end of the invalid link_rx_tail (causes the abort).

(Imported from Jira ZEP-2122)

nashif commented 7 years ago

by Michael Scott:

0010-HACK-arm-soc-nordic-nRF52-Add-MPU-support-for-96b_ni.patch (size: 5481)

nashif commented 7 years ago

by Michael Scott:

Hi Vich,

I've also added a debug_check function at the top of isr_conn_rx(). It checks for packet_rx[0..max_count].hdr.onion.link values outside of SRAM and if it finds 1 will print the index # etc. Seemingly randomly, I'll get an index with an invalid link address after some unknown process happens and I'll start spewing debug output.

I'm going to force an abort there when I catch it and try and do a gdb backtrace to see where it's coming from. But so far the point of link address corruption has been quite elusive.

nashif commented 7 years ago

by Vinayak Kariappa Chettimada:

Are you on encrypted connection? Could you please use 251 bytes in DLE feature in the Controller kconfig. This will space rx buffers apart enough to rule our buffer overruns. I have a theory but need your results with 251 byte

nashif commented 7 years ago

by Carles Cufi:

Are you building the hci_uart sample?

nashif commented 7 years ago

by Michael Scott:

I am building a custom Zephyr app which uses TCP, MQTT and 6lowpan: https://github.com/linaro/zephyr-fota-hawkbit

We use this with a master branch based zephyr here: https://github.com/linaro/zephyr (This is really a dev branch to hold our pre-upstream patches)

nashif commented 7 years ago

by Michael Scott:

Not an encrypted connection. I'll run some tests with CONFIG_BLUETOOTH_CONTROLLER_DATA_LENGTH_MAX=251 and let you know what I see.

nashif commented 7 years ago

by Mark Linkmeyer:

Vinayak Kariappa Chettimada , can you set the priority please? Thx.

nashif commented 7 years ago

by Michael Scott:

So I still get faults with the DLE set to 251. I'm wondering what stress tests are currently used for BT 6lowpan?

Right now it's a bit hard to setup our FOTA app for testing due to the various requirements of having a MQTT server, etc.

nashif commented 7 years ago

by Carles Cufi:

Michael Scott Thanks for all the info so far. It would be really great if you could describe the full setup used to reproduce this issue. That way we can try and reproduce it locally with an instrumented build and try to pinpoint the cause.

nashif commented 7 years ago

by Carles Cufi:

Michael Scott Also, since this is a memory overrun of sorts, it would be interesting to see if increasing the stack sizes helps.

I believe the current master settings are probably around:

CONFIG_MAIN_STACK_SIZE=1024
CONFIG_IDLE_STACK_SIZE=320
CONFIG_ISR_STACK_SIZE=2048
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=1024
CONFIG_OFFLOAD_WORKQUEUE_STACK_SIZE=1024
CONFIG_BLUETOOTH_RX_STACK_SIZE=1024
CONFIG_BLUETOOTH_HCI_TX_STACK_SIZE=640
CONFIG_BLUETOOTH_CONTROLLER_RX_PRIO_STACK_SIZE=320

It would be nice if you could try doubling those numbers and trying to reproduce.

nashif commented 7 years ago

by Michael Scott:

Part of the problem (which may actually be the cause) is that Nitrogen has 64K SRAM and our current app uses almost all of it. I can't enable much more in the CONFIGs. Let me see if I can turn on stack debugging to see if any of the stacks are close to limit.

nashif commented 7 years ago

by Michael Scott:

Update to this issue:

I worked up a function which iterates through all of the radio memory lists as well as validates every link pointer looking for invalid memory references (signs of corruption). And then I littered the isr functions with calls to it to try and see if I could pin point areas in the radio code where the corruption was happening. (Anytime the function hits corruption, it generates an abort). Then, I added a cache of the last valid values and dump comparisons of memory to see where it was getting trashed.

I found that the result was really random and could happen just about anywhere in the radio memory.

After some research I located the "static u8_t MALIGN(4) _radio[LL_MEM_TOTAL];" declaration in subsys/bluetooth/controller/ll_sw/ll.c which establishes the memory passed into the radioinit() function. Above that declaration are several ticker* variable declarations. On a lark I moved the radio_memory var up above the others, just to see if it would have any affect in how often I was seeing crashes (and eliminate them as potential overrun sources, etc): https://hastebin.com/onaqevoxaf.cs

Oddly this seems to have fixed the issue I was seeing. Currently, I have 1x PCA10056 and 3x 96b_Nitrogens running 5 second MQTT pulses via BT 6lowpan to a GW talking to IBM Bluemix for upwards of 30 minutes straight. Until now, I would have crashed in under 3-4 minutes.

I haven't had time to look at what uses the ticker_* variables or how it could be causing the corruption I was seeing yet.

NOTE: I could be totally jumping the shark on this one. I've been staring at the same set of code for quite a while. So, don't take this too seriously till I debug further.

nashif commented 7 years ago

by Michael Scott:

More testing and it does look like I completely jumped the shark.

I was testing a new Zephyr rebase where we merged in updates since 4/28 and it looks like this issue was fixed by one of the recent commits. Not sure which.

nashif commented 7 years ago

by Michael Scott:

Correction: I had changes in my local files from late last night. Those are what's fixing my issue. I'll submit them upstream for review.

nashif commented 7 years ago

by Michael Scott:

Pull request is here: https://github.com/zephyrproject-rtos/zephyr/pull/134

nashif commented 7 years ago

by Vinayak Kariappa Chettimada:

Michael Scott , I have review the PR, but all that the changes do is to have unused bytes in each buffer when default PDU size on a connection is 27 bytes, which in turn leads to _radio.packet_rx_data_count becoming a constant. Did the issue disappear with your changes when using 27 bytes PDU sizes or 251 byte PDU sizes?

(side note: since same buffer is used for adv reports, max. of adv report and data pdu size is used as the minimal default allocation size, this design will change when LE Adv Extensions is implemented; plan to have separate pool for advertisement rx).

nashif commented 7 years ago

by Michael Scott:

I wouldn't say that's all they do. I will drop the 2 patches which lock in the rx packet count as these were changes I had made to get DLE working prior to realizing I was getting a 0 in the length request.

Also, I think checking the length request values for 0 isn't quite good enough. We should probably make sure that the max values are greater or equal to the default minimum (27).

nashif commented 7 years ago

by Vinayak Kariappa Chettimada:

I was referring to the patch related to buffer size calculations.

And yes, DLE related ones I agree, check for greater than default is required. Curious, which peer is sending 0 as tx octets?

nashif commented 7 years ago

by Michael Scott:

Vinayak Kariappa Chettimada We're seeing this coming from a Linux 4.11 gateway device. It does raise the question of a possible BlueZ bug. (Tho, I think the checks here are still good to have.)

nashif commented 7 years ago

by Michael Scott:

Fix has been merged.

nashif commented 7 years ago

by Vinayak Kariappa Chettimada:

Fixed in 5699af2af06f9bca9c1a7a44518b95b97794216b

zephyriot / zephyr-issues

BT controller occasionally generates imprecise data abort #1963