zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.59k stars 6.48k forks source link

Bluetooth: host stack is broken without `CONFIG_ASSERT` #73316

Open jori-nordic opened 4 months ago

jori-nordic commented 4 months ago

The Bluetooth host stack doesn't implement enough error handling. A lot of parameter and state checking happens using __ASSERT() macros.

This is a good thing, as it reduces the mental load of parsing error handling. That "error handling" usually results in a broken state somewhere else, leaks, etc..

So we need some sort of ASSERT() macro that is always enabled.

Two solutions:

NB: in the future, that macro could be redirected to allow graceful shutdown of the Bluetooth thread, if running in userspace. That way an error condition in the stack will no longer bring down the entire kernel.

henryzhang62 commented 4 months ago

Would you please add steps to reproduce it?

alwa-nordic commented 4 months ago

Would you please add steps to reproduce it?

I believe this refers to cases like bt_conn_unref(NULL) (in application code) evaluating to undefined behavior (UB) when asserts are off. The question is whether it's harmful to allow UB as a result of API misuse when asserts are off, in public API, or even internal API.

The argument is that developer time saved by catching these errors is well worth the extra bytes and cycles many times over. I agree. My opinion is that simple asserts should be on-by-default globally in Zephyr, and we maintainers should make sure assert expressions are not too expensive. Disabling asserts should be a this-train-needs-no-brakes move by those who are forced into that particular requirement and are ready to bear the cost of harder debugging.

pdgendt commented 3 months ago

Just some comments to take along:

alwa-nordic commented 3 months ago

Just some comments to take along:

  • Assertions in most languages are no-ops when disabled, and crash/abort if a predicate is false, so we shouldn't depend on it in production code.

  • If a public API has documented assumptions, developers should adhere to them, and add checks before calling the functions, assertions in the callee code are there to help developers point out invalid calls.

  • There is also CHECKIF that can be used as an assertion check or a runtime check (enabled by default with USERSPACE).

I agree with you on all points. I think the issue here is: What should we recommend to our users? Asserts always on? Or asserts off in production? A default config is a pretty strong recommendation, so we should give it some consideration.

I'm of the opinion that asserts enabled in production is a good thing by default. And those who have special needs can make the choice for themselves. So I wish we could turn the tide and enable asserts by default for all of Zephyr.

jori-nordic commented 3 months ago

Some data for @Thalley . I was expecting more asserts than that 😅

The point is, maybe we should have a IF_RETURN(condition, message, retval) macro for the common pattern of checking parameters/state, logging an error and returning some value. Would de-clutter the host a bit I think.

null de-references: https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/gatt.c#L3469-L3486

https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/ecc.c#L63-L66

Broken logic: https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/att.c#L350-L352

https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/conn.c#L1506-L1511

ACL Control flow broken https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/hci_core.c#L574-L579

Bonding that falls on the floor if there's an error https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/smp.c#L3863-L3866

alwa-nordic commented 3 months ago

Bonding that falls on the floor if there's an error

https://github.com/zephyrproject-rtos/zephyr/blob/243783243cbde53b5353d645c2099eaf48cf8083/subsys/bluetooth/host/smp.c#L3863-L3866

This is an actual assertion. I assert that it's impossible for it to fail. You can read 'assert' as 'theorem' here.

Thalley commented 3 months ago

Some data for @Thalley . I was expecting more asserts than that 😅

The point is, maybe we should have a IF_RETURN(condition, message, retval) macro for the common pattern of checking parameters/state, logging an error and returning some value. Would de-clutter the host a bit I think.

Hmm, we could possibly do that. However in https://docs.zephyrproject.org/latest/contribute/guidelines.html#coding-style we state

In general, follow the Linux kernel coding style

Which in return in https://www.kernel.org/doc/html/v4.10/process/coding-style.html#macros-enums-and-rtl states

Things to avoid when using macros:

macros that affect control flow:

define FOO(x) \

    do {                                    \
            if (blah(x) < 0)                \
                    return -EBUGGERED;      \
    } while (0)

And I generally agree with that. Implicit return statements (via macros or other ways) really messes with the readers flow.

So if you want to propose a macro with a return, be prepared for some pushback on that (and not just from me ;) )

jori-nordic commented 3 months ago

DTS macrobatics are kosher though, definitely don't mess with people's heads, writing a driver is E-Z. I would also argue that an ifdef soup like we seem to love qualifies as a "macro that messes with the control flow". Sigh.. sorry for the deranged stream of thought, I'm just frustrated we're stuck with this language :|

In principle, I also agree that macros shouldn't mess with control flow. Doing it correctly will end up with having to scroll through one page of error checking prologue for each API, all with the same if (bad) { log-something; return -EFIGURETHISOUT } copy-pasted all over. When I'll get to the fn logic, I'm sure I will already have forgotten why I was looking at that fn in the first place..

I really don't know what to do.

Thalley commented 3 months ago

Sigh.. sorry for the deranged stream of thought, I'm just frustrated we're stuck with this language :|

It was going so well until you and @alwa-nordic started coming here with all your rationale and logic. It was so much easier when no-one questioned why we do what we do ;)

jori-nordic commented 3 months ago

To add some perspective, we tried to write an H4 HCI packet decoder in rust with @alwa-nordic and it was even more verbose than the C version :sweat_smile: . So not sure the grass is greener on that side either.

alwa-nordic commented 3 months ago

To add some perspective, we tried to write an H4 HCI packet decoder in rust with @alwa-nordic and it was even more verbose than the C version 😅 . So not sure the grass is greener on that side either.

This does not represent my opinion. /disclaimer :slightly_smiling_face:

alwa-nordic commented 3 months ago

Doing it correctly will end up with having to scroll through one page of error checking prologue for each API, all with the same if (bad) { log-something; return -EFIGURETHISOUT } copy-pasted all over. When I'll get to the fn logic, I'm sure I will already have forgotten why I was looking at that fn in the first place..

Maybe this could be solved be adding folding marks around the obvious stuff. :thinking: