zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.86k stars 6.62k forks source link

Bluetooth connection timeout #14235

Closed neha0119 closed 5 years ago

neha0119 commented 5 years ago

Problem After connecting to board over bluetooth, the subscription fails and hence disconnect happens.

Set up - We are using custom board which has two chip configuration for BLE stack. The controller being Laired connectivity BL651 chip which has nRF52810 MCU, it runs zephyr with HCI ( hci_uart) controller stack.

zephyr version is 1.13.99 On central side we use Bluegiga BLE module with Python scripts on Windows host.

To Reproduce Build the host side peripheral example zephyr image and hci_uart example for the controller side. cd hci_uart mkdir build cmake -GNinja -DBOARD=nrf52_pca10040 .. ; ninja

For peripheral cd peripheral mkdir build cmake -GNinja -DBOARD=myboard .. ; ninja

Expected behavior The central should be able to read all the characteristics of the board. However, it turns out that it fails while fetching and receives a EVT_DISCONNECT.

Central console output At Central (python on windows machine) error - DEBUG:bgapi.py:Expecting a response of one of [<ResponsePacketType.attclient_find_information: 41>] within 1.000000s DEBUG:bgapi.py:Received a ResponsePacketType.attclient_find_information packet: Success DEBUG:bgapi.py:Expecting a response of one of [<EventPacketType.attclient_procedure_completed: 17>] within 1.000000s DEBUG:bgapi.py:Received a EventPacketType.connection_disconnected packet: Success

jhedberg commented 5 years ago

Are you able to reproduce this using a different central, e.g. a mobile phone, or another instance of Zephyr?

jhedberg commented 5 years ago

Also, what is the exact version you're using? "1.13.99" doesn't help much, since that can be anything between 1.13 and the 1.14-rc1.

neha0119 commented 5 years ago

Yes. We have tried with BLE scanner app on Phone and with python scripts on the Windows machine. Unable to connect with both.

We had taken SHA ID "32cbc42ce620783ae6f14d9affe68b1a5cbe21a1" To check the exact tag I ran git describe --tags 32cbc42ce620783ae6f14d9affe68b1a5cbe21a1 warning: tag 'zephyr-v1.13.0' is really 'v1.13.0' here zephyr-v1.13.0-3320-g32cbc42ce6

neha0119 commented 5 years ago

Ok .. will add little more to the above description. BLE scanner app does discover our board. So when I click on the connect button ... After some time the app says "time out" However on the board, we see "Connected and then Disconnected (reason 8)" prints.

Similarly with the python script. We do get the see the connect successful and then script sends fetch characteristics - where it fails and disconnect happens.

neha0119 commented 5 years ago

Hi,

Our further analysis - At the central side, When the python script starts fetching the Characteristics it sends the Request as attclient_find_information to which the response is successful. This command is followed by Bluetooth attclient_procedure_completed to which the response is [128, 3, 3, 4, 0, 8, 2] which means disconnect. Hence, it means that controller issued a response to disconnect instead of the version_ind_packet

Going through the version_ind_send code in blue tooth, it sends an empty packet and I think waits for the next connection interval to send the actual data. The central has predefined expectation of data and hence is not ack-ing? to which the peripheral sends the disconnect event.

Is this the right direction we are in ??

Really looking forward to a reply on it. Thank you so much !

neha0119 commented 5 years ago

Another update - Checked with other centrals also and tried to sniff the packet. What we understood is that moment after connect happens, in a while disconnect event with error 0x0208 is sent.

Also observed one thing, addr_type set in our board is Random. ( this thing we noticed while we were comparing with other logs).

neha0119 commented 5 years ago

HI, Any inputs on this please???

Vudentz commented 5 years ago

@neha0119 any chance to get the HCI traffic on either end? You can also enable GATT/ATT debug options if you think this is related to GATT/ATT layers, but it seems to be related to the ACL disconnecting for some reason, you could perhaps enable conn logs if you can't find a way to record the HCI traffic in Windows.

neha0119 commented 5 years ago

Zephyr_host.log

Thanks for replying ... PFA the host logs We could capture very minimum HCI traffic with the BLUEgiga BLE graphical Tool Interface. Below is the BLUEGIGA log when I click connect.


2019.03.11 20:35:22.214 ble_cmd_gap_connect_direct address: e9 d4 b0 7a b9 58 addr_type: 1 (0x01) conn_interval_min: 40 (0x0028) conn_interval_max: 56 (0x0038) timeout: 1000 (0x03e8) latency: 0 (0x0000) 2019.03.11 20:35:22.221 TX: 000f0603e9d4b07ab9580128003800e8030000

2019.03.11 20:35:22.226 ble_rsp_gap_connect_direct result:0x0000 ['No Error'] connection_handle: 0 (0x00) 2019.03.11 20:35:22.227 RX: 00030603000000

2019.03.11 20:35:22.268 ble_evt_connection_status connection: 0 (0x00) flags: 5 (0x05) address:58:b9:7a:b0:d4:e9 address_type: 1 (0x01) conn_interval: 42 (0x002a) timeout: 1000 (0x03e8) latency: 0 (0x0000) bonding: 255 (0xff) 2019.03.11 20:35:22.275 RX: 801003000005e9d4b07ab958012a00e8030000ff

2019.03.11 20:35:32.341 ble_evt_connection_disconnected connection: 0 (0x00) reason:0x0208 ['Link supervision timeout has expired.'] 2019.03.11 20:35:32.341 RX: 80030304000802


neha0119 commented 5 years ago

@Vudentz One more observation While the board boots, I get this message. Does this mean ACL flow control from controller to host not supported? bt_hci_core: Controller to host flow control not supported

bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) bt_hci_core: HW Variant: nRF52x (0x0002) bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 99 bt_hci_core: Identity: c3:91:9c:4f:d1:bb (random) bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1 bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
jhedberg commented 5 years ago

@neha0119 it means you've enabled CONFIG_BT_HCI_ACL_FLOW_CONTROL=y in your app for no good reason (it's neither needed nor supported with a combined host + controller build)

jhedberg commented 5 years ago

@neha0119 actually, for a split build it may be useful, however the Zephyr controller doesn't support it. Either way, I think you can just disable that for your app.

neha0119 commented 5 years ago

DISABLED!

Vudentz commented 5 years ago

2019.03.11 20:35:32.341 ble_evt_connection_disconnected connection: 0 (0x00) reason:0x0208 ['Link supervision timeout has expired.'] 2019.03.11 20:35:32.341 RX: 80030304000802

So you are getting a link supervision timeout, which is normally some sort of problem with hardware/antenna setup. Though contrary to the log you pasted I can see some GATT traffic, or perhaps the HCI logs doesn't show ACL traffic? Anyway I don't see any subscription just a discovery going on (something worth noting that the discover seems to be using Find Information request instead of starting discovering services):

[00:00:43.519,237] bt_att.bt_att_recv: Received ATT code 0x04 len 5 [00:00:43.519,248] bt_att.att_find_info_req: start_handle 0x0001 end_handle 0xffff [00:00:43.519,285] bt_att.find_info_cb: handle 0x0001 [00:00:43.519,294] bt_att.find_info_cb: handle 0x0002 [00:00:43.519,313] bt_att.find_info_cb: handle 0x0003 [00:00:43.519,322] bt_att.find_info_cb: handle 0x0004 [00:00:43.519,333] bt_att.find_info_cb: handle 0x0005 [00:00:43.519,352] bt_conn.bt_conn_send_cb: conn handle 0 buf len 26 cb 0x0000b225

And then it disconnects after having received just one request, Find Information, and transmitting a response (which probably never arrived).

neha0119 commented 5 years ago

ConnLogs.txt My apologies, I enabled the other debug config. FA the DEBUG_CONN enabled prints.

To your point- And then it disconnects after having received just one request, Find Information, and transmitting a response (which probably never arrived). The prints at the Windows machine does state that "Received a ResponsePacketType.attclient_find_information packet: Success" Then immediately the python script @windows says "Expecting a response of one of [<EventPacketType.attclient_procedure_completed: 17>" to this, no response is seen. Instead an event - ble_evt_connection_disconnected connection: 0 (0x00) is observed.

So this is in partial sync with your observation. This also means its not a hardware issue, as the first response has reached central. But the question now becomes What happens with the 2nd request? Board received it but controller is in some state? Hence it doesn't send to host. Instead send disconnect event to the Central...

Or Host has asked to close the connection (for some reason) and controller sends that to the Windows machine to close it?

why connection_disconnection event is being generated by the board and not the expected response/event?

neha0119 commented 5 years ago

The working case Script logs on windows

Expecting a response of one of [<ResponsePacketType.attclient_find_information: 41>] within 1.000000s Received a ResponsePacketType.attclient_find_information packet: Success Expecting a response of one of [<EventPacketType.attclient_procedure_completed: 17>] within 10.000000s Received a EventPacketType.connection_version_ind packet: Success Received a EventPacketType.attclient_find_information_found packet: Success Received a EventPacketType.attclient_find_information_found packet: Success Received a EventPacketType.attclient_find_information_found packet: Success

neha0119 commented 5 years ago

Now I have reduced my setup with just my board and the BLuegiga BLE Graphical User Interface Tool (as Central). My board is discover able by Bluegiga Interface tool and now I click the CONNECT button. ( Logs below are the ones shown by the Bluegiga tool right after the CONNECT button is clicked)

2019.03.12 17:08:08.403 ble_cmd_gap_connect_direct address: bb d1 4f 9c 91 c3 addr_type: 1 (0x01) conn_interval_min: 60 (0x003c) conn_interval_max: 76 (0x004c) timeout: 100 (0x0064) latency: 0 (0x0000) 2019.03.12 17:08:08.410 TX: 000f0603bbd14f9c91c3013c004c0064000000

2019.03.12 17:08:08.414 ble_rsp_gap_connect_direct result:0x0000 ['No Error'] connection_handle: 0 (0x00) 2019.03.12 17:08:08.416 RX: 00030603000000

2019.03.12 17:08:08.451 ble_evt_connection_status connection: 0 (0x00) flags: 5 (0x05) address:c3:91:9c:4f:d1:bb address_type: 1 (0x01) conn_interval: 60 (0x003c) timeout: 100 (0x0064) latency: 0 (0x0000) bonding: 255 (0xff) 2019.03.12 17:08:08.457 RX: 801003000005bbd14f9c91c3013c0064000000ff

2019.03.12 17:08:09.574 ble_evt_connection_disconnected connection: 0 (0x00) reason:0x0208 ['Link supervision timeout has expired.'] 2019.03.12 17:08:09.574 RX: 80030304000802

------------ After one response and an event, board sends disconnect event.

Whereas the working case (which is BLE scanner app as advertiser and Bluegiga Tool(As Central)) sends ble_evt_connection_version_ind event right after ble_evt_connection_status connection event.

Vudentz commented 5 years ago

Or Host has asked to close the connection (for some reason) and controller sends that to the Windows machine to close it?

The host never set that error thought, so I guess this is not the host doing at least.

why connection_disconnection event is being generated by the board and not the expected response/event?

Well perhaps something is not right with the link so it timed out for some reason, you could perhaps try with another board or qemu on Linux to check if that is not a problem with the board or the controller.

cvinayak commented 5 years ago

@neha0119 could you please check if the BL651 module has a 32KHz external crystal? If you are using nRF52_pca10040 build on BL651, then it is expected to use external crystal for maintaining the BLE connection.

If BL651 does have a crystal, then you need to verify the crystal accuracy, and use the correct value in the Zephyr build, else the connections will drift leading to eventual disconnection.

neha0119 commented 5 years ago

Hello @cvinayak No we don't have external 32Khz external crystal. As per the BL651 module, the external crystal is optional. Hence we had not mounted it on out board. We do have enabled the two configs in the zephyr. CLOCK_CONTROL_NRF_K32SRC_RC and CLOCK_CONTROL_NRF_K32SRC_500PPM But still there are disconnections all the time.

You mentioned that "it is expected to use external crystal with nRF52_pca11040 build" .. but we have not mounted it, is there any workaround?

cvinayak commented 5 years ago

@neha0119 as there is support for BL652 board, could you check the difference in Kconfig with that board? also, have you tired using the latest upstream/master hci_uart or v1.14.0-rc1? if you are on slack please ping me there for quicker discussions.

neha0119 commented 5 years ago

Hi @cvinayak We tried those two patches which you had shared. But with that also we are getting disconnect.

cvinayak commented 5 years ago

@neha0119 with https://github.com/zephyrproject-rtos/zephyr/pull/11552 applied, you had the below or the default of 0?

CONFIG_BT_CTLR_MASTER_SCA_EXTRA=500

If our timezones match, we should have a screen sharing session wherein we can debug live.

carlescufi commented 5 years ago

@neha0119 have you managed to test the suggested changes from @cvinayak?

carlescufi commented 5 years ago

Thanks @neha0119. Closing this issue then.