ble: gatt: db_hash_work runs for too long and makes serial communication fail

closatt commented 2 years ago

Environment Zephyr version: 2.5 Target: NRF52840

Bug description Just after registering a service, UART communication is blocked for more than 10ms, causing a communication failure. In the following image, the moment when the bus is interrupted is circled in red. As a result, the chip answers before the tx communication finishes.

Some investigation showed that the db_hash_work (gatt.c) execution is taking several ms, and blocks all serial communications during this time as sysworkq is very high priority:

On the SystemView capture below, we see that the sysworkq is interrupting our LPWAN thread (which is making UART communications) for 10ms.
By printing the work pointers, I saw that db_hash_work was executed just at the time of the bug. Setting a k_busy_wait(5000) in db_hash_process confirmed this, as the interruption time was now 15ms.

The execution time of this work depends of the number of services (in this case 13 services).

The only simple temporary workaround we found (which is not thread safe) is to add a k_sleep of several ms after each call to bt_gatt_service_register.

To Reproduce

Register a lot of services with bt_gatt_service_register
A bit less than 10ms after registering the last service, make a long UART communication

This communication failure is not easy to reproduce as it requires a particular timing. But even without reproducing the communication failure, the fact that db_hash_work lasts a long time can be easily seen with Segger SystemView (or even with timestamped logs placed at the beginning and at the end of db_hash_process function). For this, it is enough to register a lot of services (for example 15 with each 5 characteristics) with bt_gatt_service_register and capture the right moment, 10ms after the last service register. Maybe the execution time is completely different on another target than the NRF52840.

Expected behavior If this db_hash_process can be interrupted, it should maybe be executed in a low priority thread. If it cannot be interrupted, I think there should be at least a callback indicating the caller of bt_gatt_service_register that db_hash_process has finished and serial communications are safe. There are certainly better solutions that I didn't think of.

carlescufi commented 2 years ago

@thoh-ot assigned this to you following-up on the work you did on GATT hashing recently, but feel free to reassign. By the way, have you seen this on the Oticon side?

@closatt can you please try to reproduce with the latest Zephyr? And, if possible, provide a basic sample that can reproduce the problem.

carlescufi commented 2 years ago

@thoh-ot will you be able to take a look at this? Otherwise we'll try to find someone else to assign this to.

thoh-ot commented 2 years ago

@thoh-ot will you be able to take a look at this? Otherwise we'll try to find someone else to assign this to.

So I have been thinking a little bit about this - to me this seems like an architectural issue with using the system work queue for "long running" work while also servicing other constraints in the system.

As the system work queue executes the work in a cooperative thread with fairly high priority I think k_yield() in the db_hash_process() is the only short term solution to this problem.

I won't have the capacity to make a long term viable solution.

cvinayak commented 2 years ago

@jori-nordic could take a look into using a user work queue or pre-emptible threads to defer long duration operations/functions. Ex. BT ECC thread (hci_ecc.c).

jori-nordic commented 2 years ago

@closatt could you try reproducing the bug with #46131 checked out ? You can just do git fetch origin pull/46131/head:pr-46131 && git checkout pr-46131 to get the latest version.

closatt commented 2 years ago

@jori-nordic thanks for the patch and sorry for the delay. I have rebased and tested your commit on Zephyr 2.7 since I have no software running on the master. I had troubles with SystemView always crashing at init so I couldn't use it to test your patch. So I checked with logs, with lots of services registered, if a long UART communication from a low-priority thread was interrupted by the db_hash.work process. Without your patch, the UART communication was interrupted by the db_hash.work process. But with your patch, the UART communication finished without interruption before the db_hash.work process was launched. So it does solve the issue I mentionned.

jori-nordic commented 2 years ago

@closatt good to hear. Something else needed attention the past week, but I'll try to merge that patch this week.

zephyrproject-rtos / zephyr

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811