zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.1k stars 6.2k forks source link

Bluetooth: Host: Extended advertising reports may block the host #50786

Open Thalley opened 1 year ago

Thalley commented 1 year ago

Is your enhancement proposal related to a problem? Please describe. In the case that we are doing scanning in a very busy environment (e.g. hundreds of extended advertising reports per second), then the host will receive so many events that it may effectively block the host from working, if it receives more reports than it can handle. Many reports may cause the events to be queued, and thus calling an HCI command (e.g. bt_le_scan_stop()) will timeout waiting for the return event.

This is only an issue for extended advertising reports (BT_HCI_EVT_LE_EXT_ADVERTISING_REPORT) as legacy reports (BT_HCI_EVT_LE_ADVERTISING_REPORT) are usually marked as discardable, e.g. by

        if (rx.evt.evt == BT_HCI_EVT_LE_META_EVENT &&
            (rx.hdr[sizeof(*hdr)] == BT_HCI_EVT_LE_ADVERTISING_REPORT)) {
            BT_DBG("Marking adv report as discardable");
            rx.discardable = true;
        }

Due to the reassembly requirements of extended advertising, we cannot just discard all extended reports (we could potentially discard reports with the status BT_HCI_LE_ADV_EVT_TYPE_DATA_STATUS_COMPLETE). We could only discard the ones marked as legacy or complete. This is because if we did, the reassembler would run out of sync.

This is basically a performance issue, and thus not marked as a bug.

Describe the solution you'd like Ideally we'd find some way of discard extended advertising reports, so that we can just discard them and keep the number of events queued small, thus eliminating the issue with HCI command events.

Describe alternatives you've considered A small work around is discard all advertising reports once bt_le_scan_set_enable(BT_HCI_LE_SCAN_DISABLE); has been called. It is safe to assume that the upper layers do not care about any queued reports once this has been called, and that may help to clear the queued reports.

Additional context This was found doing UPF and was happening often as extended advertising was used extensively. The issue appeared using the nRF5340.

Thalley commented 1 year ago

@rugeGerritsen This is the host issued I mentioned at the UPF when receiving many scan reports.

rugeGerritsen commented 1 year ago

@Thalley, thanks for creating this issue. Unfortunately, I think there is no way we can know what extended advertising reports we can discard, and that is also true for reports with data status set to BT_HCI_LE_ADV_EVT_TYPE_DATA_STATUS_COMPLETE. The problem is that we don't know if such a report is both first and last or only last.

Example: We receive advertising reports from one advertiser. It generates three sets of reports, A, B, and C. Each of the sets contain three partial reports (A1, A2, A3, B1, ...).

Let's now say that we receive them all in order. In that case A3, B3 and C3 will have its data status set to BT_HCI_LE_ADV_EVT_TYPE_DATA_STATUS_COMPLETE.

Discarding A3 will result in invalid reassembly as it will concatenate A and B.

I believe that this is an issue in the spec. It restricts the architecture of the system to have a host that polls for events rather than a controller pushing events. This is especially an issue for systems where the controller and host are not running on the same core or device.

hermabe commented 1 year ago

If we discard any An we must discard all previously reassembled reports from A (A1, ... , A(n-1)) and keep discarding until we get a report from A with data_status==complete. However, to do this we sort of have to do the reassembly anyway. The current implementation should not reassemble reports from different advertisers, but we will run into this warning, and future reports will be incomplete.

alwa-nordic commented 1 year ago

This is basically a performance issue, and thus not marked as a bug.

I think we can upgrade this to a bug because it sounds quite serious. The trigger is external (other devices advertising a lot) and it results in a unresponsive device.

Thalley commented 1 year ago

This is basically a performance issue, and thus not marked as a bug.

I think we can upgrade this to a bug because it sounds quite serious. The trigger is external (other devices advertising a lot) and it results in a unresponsive device.

That is a good point. Basically it allows other devices to DoS the Zephyr host.

jori-nordic commented 1 year ago

@Thalley I had a quick pass through the code and don't really understand how the queued events would block the host.

bt_recv has a check for BT_HCI_EVT_FLAG_RECV_PRIO and for command complete/status, that flag will be set, and thus the event will be processed in-place rather than placed in the queue (with rx_queue_put(buf)).

Or is that an issue of the (app core) CPU just being overloaded?

What I think is that the behavior you saw was caused by the RPMsg driver effectively halting the IPC (for 10 full seconds) when it didn't get a buffer for the (non-discardable) ext adv report. See this particular line. That means that the command complete/status was being blocked in the controller rather than the host.

jori-nordic commented 1 year ago

I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).

It's not perfect, but I think rubin is right, this is a spec failure, so that's pretty much the only thing we can do in the host without adding yet another buffer pool.

hermabe commented 1 year ago

I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).

This can lead to the start of adv reports missing, making the adv data fields unparsable.

jori-nordic commented 1 year ago

I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).

This can lead to the start of adv reports missing, making the adv data fields unparsable.

How so, in my mind if you only ever have one that re-assembled, then it should be fine? The other partial adv reports will get discarded.

hermabe commented 1 year ago

I will try to implement a simple mitigation: only accept partial ext adv reports if they're from the advertiser being re-assembled. Any other partial ext adv reports will get discarded immediately (i.e. not put on the discarded pool).

This can lead to the start of adv reports missing, making the adv data fields unparsable.

How so, in my mind if you only ever have one that re-assembled, then it should be fine? The other partial adv reports will get discarded.

Do we still discard if we are not currently reassembling anything? That was the case I was thinking of. Interleaving of incomplete adv reports is already not supported (and not needed for sdc, not sure about zll).

jori-nordic commented 1 year ago

We won't discard if we are not reassembling anything no. I will also try first w/ the discrimination done on the first report of the event.

Later, I'll iterate on the reports in the event, discarding the ones I don't care about, effectively re-writing it. This will necessitate a scratch buffer to do so.

edit: yeah most of this is based on the fact that we already don't support interleaved reports.

Thalley commented 1 year ago

What I think is that the behavior you saw was caused by the RPMsg driver effectively halting the IPC (for 10 full seconds) when it didn't get a buffer for the (non-discardable) ext adv report. See this particular line. That means that the command complete/status was being blocked in the controller rather than the host.

I've considered the RPMsg part to be the culprit here as well. I am unsure whether this issue would/could appear on e.g. the nRF52840, as I've only tested it with the nRF5340 which has this RPMsg layer between the controller and the host.

If the issue is indeed from the controller side (which in my particular case, is not Zephyr), then we may be powerless to fix this issue.

jori-nordic commented 1 year ago

@Thalley so it's not really controller, but rather the message is stuck on the network core because we are stalling the IPC RX endpoint (in the app core). The stall happens because we are doing a blocking wait for event buffers, in a loop.

So I think we can definitely work around it (in the app core/host hci driver), which am trying right now. The w/a might be quite involved though, see the discussion I had this morning with herman above.

jori-nordic commented 1 year ago

@Thalley so I have a workaround prototype but I'm a bit unable to reproduce the original issue of the "host" locking up.

Would you have a specific configuration or sample where you originally saw that? If you're using a custom network core image, that would also be helpful.

For reference, I'm using 5 52840 DKs each programmed with the multiple_broadcaster sample, with 500B adv data and 10 SIDs, so I'd think that it'd be enough noise to trigger the behavior, but it doesn't, I'm still able to stop advertising in the nRF5340's shell. edit: without the workaround, I mean I can stop the scanner without needing the w/a.

Thalley commented 1 year ago

@Thalley so I have a workaround prototype but I'm a bit unable to reproduce the original issue of the "host" locking up.

Would you have a specific configuration or sample where you originally saw that? If you're using a custom network core image, that would also be helpful.

I am actually at a UPF right now, where we have even more noise (more advertisers) than last time where this issue was seen, but I have not seen this issue a single time the entire week.

I am not quite sure what changed, but I cannot reproduce anymore either.

jori-nordic commented 1 year ago

Then I'll just note down the WIP branch with the in-progress workaround (https://github.com/jori-nordic/zephyr/commits/ext-adv-debug) and close this issue as unreproducible.

theob-pro commented 3 months ago

Seen again at the UPF by @Thalley on the nRF53 platform (uses the IPC driver).

github-actions[bot] commented 1 month ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

Thalley commented 1 month ago

@jori-nordic @alwa-nordic @jhedberg I guess this will be fixed with the upcoming changes?

alwa-nordic commented 1 month ago

I guess this will be fixed with the upcoming changes?

I have rough sketch for a fix. The roadmap is:

Does this look like it would solve the problem?

Once this is completed, bt_hci_le_adv_ext_report can be extended to keep statistics of dropped packets. And we could add an new type of isr-safe adv report callback that can be given complete reports even if the reassembly pool is full.

Thalley commented 1 month ago

Does this look like it would solve the problem?

Assuming that

Ignore new reports if pool is full.

Does not discard partially received reports, then yes :) i.e. we have if received the first part of a report, we cannot discard any future ones for that one.

jori-nordic commented 1 week ago

Related to https://github.com/zephyrproject-rtos/zephyr/pull/75011

carlescufi commented 1 week ago

Bluetooth WG: