zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
12.33k stars 7.53k forks source link

Driver APIs and Thread-Safety #89109

Open ldomaigne opened 1 month ago

ldomaigne commented 1 month ago

Introduction

The following RFC is the result of interactions with Zephyr adopters, some being documented on Discord .

As an application/product developer, I need to know which driver-APIs are thread-safe and which are not. Ideally, I don’t want to dig the driver code to figure this out.

I am aware that I might not have all relevant information, and that this RFC might need further refinement. The goal is to bring this topic back into awareness, and to take it to the architecture WG for discussion.

Problem description

There has been multiple RFC/PR/issues on the subject, many of which are still open. To name of few:

Based on @carlescufi answer:

Unfortunately, we never really got very far in documenting this [thread-safety] to the user, and as you describe we have very high variability in the different drivers. There was an attempt here: https://github.com/zephyrproject-rtos/zephyr/pull/21678/files, but it's not quite the same

For a driver class, we should have a consistent behavior for the API interface.

Proposed change

We need to cope both with the existing code base, and with future development, while keeping the progress manageable. We therefore suggest the following approaches:

Detailed RFC

A. API Design guidelines

B. Document existing driver classes

A. and B. can be completed independently. Task A ensures do the "right thing" with respect to thread-safety for any new development. Task B takes care of bridging any gaps in the current code base.

Proposed change (Detailed)

We follow the definition proposed by @pabigot for thread-safety:

Thread-safety: a function is thread-safe if its behavior is correct when invoked from multiple threads at the same time

Driver may choose the following design approaches:

For a driver class, this is more subtle. We should have a consistent behaviour for the API interface. This should be at least true for the mandatory APIs.

Following API meeting from April 2022, see this comment:

It is not clear from RFC 26073 if this decision is in place or not.

Dependencies

Concerns and Unresolved Questions

Alternatives

See the RFC mentioned on the subject above. This RFC tries to keep the effort manageable in the long run, while ensuring steady progress/ROI.

github-actions[bot] commented 1 month ago

Hi @ldomaigne! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

bjarki-andreasen commented 3 weeks ago

I believe we should go way simpler: Don't do any thread safety within any device driver (ISR context excluded), period. Why?

Drivers should be the dumbest, simplest part of the system, if application chooses to use threads, application gets to deal with it, this allows drivers to be reused for bootloaders, makes them way less buggy, and forces applications and subsystems to actually deal with sync, no hoping that a driver can somewhat survive being bashed with API calls from all angles :D

carlescufi commented 2 weeks ago

@ldomaigne How about next week (June 10th) for this topic?

ldomaigne commented 2 weeks ago

@ldomaigne How about next week (June 10th) for this topic?

@carlescufi : doesn't work for me, but the week after (June 17th) does.

carlescufi commented 1 week ago

@ldomaigne How about next week (June 10th) for this topic?

@carlescufi : doesn't work for me, but the week after (June 17th) does.

Settled then @ldomaigne, the 17th.

ldomaigne commented 4 days ago

Hi @bjarki-andreasen ,

thanks for your comments, very insightful.

The main motivation of this RFC is to get consistency across the drivers, and to communicate it to the user. It's not about making everything thread-safe (perhaps the RFC title chosen is somewhat unfortunate...).

I believe we should go way simpler: Don't do any thread safety within any device driver (ISR context excluded), period. Why?

Yeah, that's the option "externally synchronized driver" mentioned in the proposal.

I had the feeling that certain drivers class would be "naturally" thread safe. And so we could go a mixed-approach (but still, making sure it's consistent across the driver class). It's possibly just plain wrong.

I have one question about your proposal. Are there any implications if user mode threads are used?

bjarki-andreasen commented 4 days ago

Hi @bjarki-andreasen ,

thanks for your comments, very insightful.

The main motivation of this RFC is to get consistency across the drivers, and to communicate it to the user. It's not about making everything thread-safe (perhaps the RFC title chosen is somewhat unfortunate...).

I believe we should go way simpler: Don't do any thread safety within any device driver (ISR context excluded), period. Why?

Yeah, that's the option "externally synchronized driver" mentioned in the proposal.

I had the feeling that certain drivers class would be "naturally" thread safe. And so we could go a mixed-approach (but still, making sure it's consistent across the driver class). It's possibly just plain wrong.

I have one question about your proposal. Are there any implications if user mode threads are used?

Not any the drivers can do anything about :) We would need to rework our syscalls A LOT to prevent concurrent access to hardware from usermode, like adding some "ownership" model where usermode would need to "reserve" some hardware before calling any device driver APIs through syscalls, something like

/* get handle that can be used with "safe" syscalls */
int dev_handle = device_user_reserve(DEVICE_DT_GET(DT_NODELABEL(rtc)));
/* use handle with special "safe" syscall which checks ownership and calls rtc_get_time() internally */
int ret = user_rtc_get_time(dev_handle, struct rtc_time *timeptr);
/* let go of device for some other thread to use device */
device_user_release(dev_handle);

I think this goes ways beyond what device drivers should need to deal with :)

ldomaigne commented 4 days ago

I think this goes ways beyond what device drivers should need to deal with :)

Yeah, I agree. It's OK to shift that responsibility to the application. :)

For products subject to safety requirements (the word "safety" here has to be understood as in "safety working group"), it means we have to do something about it. While certain applications will have specific requirements and implement their own sync mechanisms, I was wondering if some common mechanisms could be provided for the most common use cases. (Just a thought at that point, I don't have the answer yet).

henrikbrixandersen commented 4 days ago

I believe we should go way simpler: Don't do any thread safety within any device driver (ISR context excluded), period.

I wholeheartedly disagree. This will just shift the whole issue up to a higher layer, further complicating those layers. Moving all thread-safety out of the device driver subsystems will make it extremely difficult to write modular software on top of these, since those independent modules will then need a shared locking mechanism.

A software architecture relies on a stable, reliable foundation - one part of the foundation is the device driver subsystems. Removing all non-ISR thread safety from the device driver subsystems will greatly weaken this foundation.

bjarki-andreasen commented 4 days ago

I believe we should go way simpler: Don't do any thread safety within any device driver (ISR context excluded), period.

I wholeheartedly disagree. This will just shift the whole issue up to a higher layer, further complicating those layers. Moving all thread-safety out of the device driver subsystems will make it extremely difficult to write modular software on top of these, since those independent modules will then need a shared locking mechanism.

A software architecture relies on a stable, reliable foundation - one part of the foundation is the device driver subsystems. Removing all non-ISR thread safety from the device driver subsystems will greatly weaken this foundation.

Device drivers are too low level to be responsible for synchronization, they don't have the neccesary information required to manage concurrent access or ownership, forcing device drivers to try anyway is what I believe weakens the driver subsystem.

Take a UART for example, does it make sense for multiple threads to be calling any UART API at the same time? Thread A and Thread B and Thread C all calling uart_poll() for example? Sure we can add a sem lock so they don't access anything concurrently, but functionally, this has no utility, they will at best all get part of a broken stream. Something higher up would need to synchronize access to the entire UART device, to do an entire transaction, before another thread can get a turn.

There are few outliers, like rtc_get_time() that could be argued are safe to be called by multiple threads, but what about rtc_set_alarm()? there is only one alarm, something needs to own it, the driver has no way to manage this.

henrikbrixandersen commented 4 days ago

Device drivers are too low level to be responsible for synchronization, they don't have the neccesary information required to manage concurrent access or ownership, forcing device drivers to try anyway is what I believe weakens the driver subsystem.

For a single API call? I disagree. The device drivers or their corresponding subsystem has the _best _knowledge of how and which API calls need synchronisation. That is not saying that all driver subsystem APIs need to be thread-safe, but it does mean that we cannot just "Don't do any thread safety within any device driver".

bjarki-andreasen commented 4 days ago

Device drivers are too low level to be responsible for synchronization, they don't have the neccesary information required to manage concurrent access or ownership, forcing device drivers to try anyway is what I believe weakens the driver subsystem.

For a single API call? I disagree. The device drivers or their corresponding subsystem has the _best _knowledge of how and which API calls need synchronisation. That is not saying that all driver subsystem APIs need to be thread-safe, but it does mean that we cannot just "Don't do any thread safety within any device driver".

Maybe we are misaligned on what device driver subsystem means, RTIO would be where synchronization to a bus would occur, not inside any SPI or I2C device driver for example. something higher up need to sync access I think.

henrikbrixandersen commented 4 days ago

Maybe we are misaligned on what device driver subsystem means, RTIO would be where synchronization to a bus would occur, not inside any SPI or I2C device driver for example. something higher up need to sync access I think.

We have many, many driver sybsystems that do not use RTIO.

carlescufi commented 4 days ago

Architecture WG meeting: