Open Thalley opened 1 year ago
@rugeGerritsen This is the host issued I mentioned at the UPF when receiving many scan reports.
@Thalley, thanks for creating this issue. Unfortunately, I think there is no way we can know what extended advertising reports we can discard, and that is also true for reports with data status set to BT_HCI_LE_ADV_EVT_TYPE_DATA_STATUS_COMPLETE
. The problem is that we don't know if such a report is both first and last or only last.
Example: We receive advertising reports from one advertiser. It generates three sets of reports, A, B, and C. Each of the sets contain three partial reports (A1, A2, A3, B1, ...).
Let's now say that we receive them all in order. In that case A3, B3 and C3 will have its data status set to BT_HCI_LE_ADV_EVT_TYPE_DATA_STATUS_COMPLETE
.
Discarding A3 will result in invalid reassembly as it will concatenate A and B.
I believe that this is an issue in the spec. It restricts the architecture of the system to have a host that polls for events rather than a controller pushing events. This is especially an issue for systems where the controller and host are not running on the same core or device.
If we discard any An we must discard all previously reassembled reports from A (A1, ... , A(n-1)) and keep discarding until we get a report from A with data_status==complete
. However, to do this we sort of have to do the reassembly anyway. The current implementation should not reassemble reports from different advertisers, but we will run into this warning, and future reports will be incomplete.
This is basically a performance issue, and thus not marked as a bug.
I think we can upgrade this to a bug because it sounds quite serious. The trigger is external (other devices advertising a lot) and it results in a unresponsive device.
This is basically a performance issue, and thus not marked as a bug.
I think we can upgrade this to a bug because it sounds quite serious. The trigger is external (other devices advertising a lot) and it results in a unresponsive device.
That is a good point. Basically it allows other devices to DoS the Zephyr host.
@Thalley I had a quick pass through the code and don't really understand how the queued events would block the host.
bt_recv
has a check for BT_HCI_EVT_FLAG_RECV_PRIO
and for command complete/status, that flag will be set, and thus the event will be processed in-place rather than placed in the queue (with rx_queue_put(buf)
).
Or is that an issue of the (app core) CPU just being overloaded?
What I think is that the behavior you saw was caused by the RPMsg driver effectively halting the IPC (for 10 full seconds) when it didn't get a buffer for the (non-discardable) ext adv report. See this particular line. That means that the command complete/status was being blocked in the controller rather than the host.
I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded
pool).
It's not perfect, but I think rubin is right, this is a spec failure, so that's pretty much the only thing we can do in the host without adding yet another buffer pool.
I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).
This can lead to the start of adv reports missing, making the adv data fields unparsable.
I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).
This can lead to the start of adv reports missing, making the adv data fields unparsable.
How so, in my mind if you only ever have one that re-assembled, then it should be fine? The other partial adv reports will get discarded.
I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).
This can lead to the start of adv reports missing, making the adv data fields unparsable.
How so, in my mind if you only ever have one that re-assembled, then it should be fine? The other partial adv reports will get discarded.
Do we still discard if we are not currently reassembling anything? That was the case I was thinking of. Interleaving of incomplete adv reports is already not supported (and not needed for sdc, not sure about zll).
We won't discard if we are not reassembling anything no. I will also try first w/ the discrimination done on the first report of the event.
Later, I'll iterate on the reports in the event, discarding the ones I don't care about, effectively re-writing it. This will necessitate a scratch buffer to do so.
edit: yeah most of this is based on the fact that we already don't support interleaved reports.
What I think is that the behavior you saw was caused by the RPMsg driver effectively halting the IPC (for 10 full seconds) when it didn't get a buffer for the (non-discardable) ext adv report. See this particular line. That means that the command complete/status was being blocked in the controller rather than the host.
I've considered the RPMsg part to be the culprit here as well. I am unsure whether this issue would/could appear on e.g. the nRF52840, as I've only tested it with the nRF5340 which has this RPMsg layer between the controller and the host.
If the issue is indeed from the controller side (which in my particular case, is not Zephyr), then we may be powerless to fix this issue.
@Thalley so it's not really controller, but rather the message is stuck on the network core because we are stalling the IPC RX endpoint (in the app core). The stall happens because we are doing a blocking wait for event buffers, in a loop.
So I think we can definitely work around it (in the app core/host hci driver), which am trying right now. The w/a might be quite involved though, see the discussion I had this morning with herman above.
@Thalley so I have a workaround prototype but I'm a bit unable to reproduce the original issue of the "host" locking up.
Would you have a specific configuration or sample where you originally saw that? If you're using a custom network core image, that would also be helpful.
For reference, I'm using 5 52840 DKs each programmed with the multiple_broadcaster
sample, with 500B adv data and 10 SIDs, so I'd think that it'd be enough noise to trigger the behavior, but it doesn't, I'm still able to stop advertising in the nRF5340's shell. edit: without the workaround, I mean I can stop the scanner without needing the w/a.
@Thalley so I have a workaround prototype but I'm a bit unable to reproduce the original issue of the "host" locking up.
Would you have a specific configuration or sample where you originally saw that? If you're using a custom network core image, that would also be helpful.
I am actually at a UPF right now, where we have even more noise (more advertisers) than last time where this issue was seen, but I have not seen this issue a single time the entire week.
I am not quite sure what changed, but I cannot reproduce anymore either.
Then I'll just note down the WIP branch with the in-progress workaround (https://github.com/jori-nordic/zephyr/commits/ext-adv-debug) and close this issue as unreproducible.
Seen again at the UPF by @Thalley on the nRF53 platform (uses the IPC driver).
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
@jori-nordic @alwa-nordic @jhedberg I guess this will be fixed with the upcoming changes?
I guess this will be fixed with the upcoming changes?
I have rough sketch for a fix. The roadmap is:
ext_scan_buf
.bt_hci_le_adv_ext_report
to:
ext_scan_buf
, even if report is complete.sync_evt_pool
K_WAIT_FOREVER.Does this look like it would solve the problem?
Once this is completed, bt_hci_le_adv_ext_report
can be extended to keep statistics of dropped packets. And we could add an new type of isr-safe adv report callback that can be given complete reports even if the reassembly pool is full.
Does this look like it would solve the problem?
Assuming that
Ignore new reports if pool is full.
Does not discard partially received reports, then yes :) i.e. we have if received the first part of a report, we cannot discard any future ones for that one.
Bluetooth WG:
Is your enhancement proposal related to a problem? Please describe. In the case that we are doing scanning in a very busy environment (e.g. hundreds of extended advertising reports per second), then the host will receive so many events that it may effectively block the host from working, if it receives more reports than it can handle. Many reports may cause the events to be queued, and thus calling an HCI command (e.g.
bt_le_scan_stop()
) will timeout waiting for the return event.This is only an issue for extended advertising reports (BT_HCI_EVT_LE_EXT_ADVERTISING_REPORT) as legacy reports (BT_HCI_EVT_LE_ADVERTISING_REPORT) are usually marked as discardable, e.g. by
Due to the reassembly requirements of extended advertising, we cannot just discard all extended reports (we could potentially discard reports with the status
BT_HCI_LE_ADV_EVT_TYPE_DATA_STATUS_COMPLETE
). We could only discard the ones marked as legacy or complete. This is because if we did, the reassembler would run out of sync.This is basically a performance issue, and thus not marked as a bug.
Describe the solution you'd like Ideally we'd find some way of discard extended advertising reports, so that we can just discard them and keep the number of events queued small, thus eliminating the issue with HCI command events.
Describe alternatives you've considered A small work around is discard all advertising reports once
bt_le_scan_set_enable(BT_HCI_LE_SCAN_DISABLE);
has been called. It is safe to assume that the upper layers do not care about any queued reports once this has been called, and that may help to clear the queued reports.Additional context This was found doing UPF and was happening often as extended advertising was used extensively. The issue appeared using the nRF5340.