kernel: Add _THREAD_SLEEPING thread state

peter-mitsis commented 1 day ago

As has been previously noted elsewhere, Zephyr's performance on the thread_metric preemptive benchmark had been observed to be significantly behind that of some other RTOSes such as ThreadX. One of the contributing factors of this has been the call to z_abort_thread() in k_thread_suspend(). Suspending a thread really should be orthogonal to timeouts and sleeps. This set of commits aims to both correct that and improve Zephyr's preemptive benchmark performance. When applied atop of the other performance related PRs, this patch set gives us numbers that are about 9% better.

To decouple the two, a new thread state has been introduced--_THREAD_SLEEPING. As all the existing bits in the thread_state field are used, the size of this field must be increased from an 8-bit field. Should this field be increased on its own, this will introduce padding gaps in the layout of the _thread_base structure. To counteract the padding two other fields have also had their sizes modified. user_options has been increased to 16 bits (it was getting closer to being full). cpu_mask has been made to always be 16-bits. These changes can be expected to have an impact on 3rd party tools.

This decoupling also results in some behavior changes.

A thread that has gone to sleep forever will no longer be resumed by k_thread_resume(). Such a thread is awakened with k_wakeup().
Suspending a thread does not cancel any timeouts, thereby allowing a thread to be both sleeping and suspended, or pending with a timeout and suspended.

Below are some performance numbers using the thread_metric's preemptive benchmark with multiq on the disco_l475_iot1 (higher is better)

Main branch: 5731301 This commit atop main: 5854334 PR #81311 and #81677 together: 6501273 This commit atop both #81311 and #81677: 7034780

teburd commented 1 day ago

I believe the changes look great. The performance gains are excellent. The commit messages I believe deserve a more detailed explanation. The why and what of the changes in each commit for this sort of PR really deserve more explanation in my opinion, particularly the last commit. Why is this netting us nearly 10% bump in performance? How did you find this?

peter-mitsis commented 1 day ago

I have expanded upon the commit messages. I hope that they are better.
This latest revision should fix the build errors from the previous CI run.

teburd commented 1 day ago

Second commit has a small typo, otherwise the messages are very clear and helpful now. Thanks!

"At the present time, Zephyr does has overlap between sleeping and suspending."

Likely meant

"At the present time, Zephyr does have overlap between sleeping and suspending."

andyross commented 1 day ago

Also the expansion of the thread state flags just bugs me aesthetically. We have too many of these already. Maybe we can separate the "true flag" ones (e.g. "ABORTING/SUSPENDING", "QUEUED") with the "enumerative" ones that are-or-at-least-should-be mutually exclusive (DEAD, PENDED, now SLEEPING/SUSPENDED), etc... With some work we could probably move obscure stuff like DUMMY and PRESTART into some other state and get it out of the mask byte. Likewise ABORTING and SUSPENDING are 98% the same state and could be discriminated in other ways than the flags.

andyross commented 1 day ago

And finally: this is the third PR now that's come along chasing performance numbers in k_thread_suspend(), which really shouldn't need to be a performance path, IMHO. That's an obscure (and extremely race-prone!) API that real apps shouldn't be relying on. This is like chasing a bunch of Linux performance numbers by looking at the cycle times of kill().

Can someone point to the test in question?

teburd commented 1 day ago

And finally: this is the third PR now that's come along chasing performance numbers in k_thread_suspend(), which really shouldn't need to be a performance path, IMHO. That's an obscure (and extremely race-prone!) API that real apps shouldn't be relying on. This is like chasing a bunch of Linux performance numbers by looking at the cycle times of kill().

Can someone point to the test in question?

https://github.com/zephyrproject-rtos/zephyr/tree/main/tests/benchmarks/thread_metric

Originally from embedded.com I believe in 2007, https://www.embedded.com/measure-your-rtoss-real-time-performance/

Stems from a report posted by Beningo (there's a pdf/slide set floating out there...) which you can see some of the results easily at https://www.embedded.com/how-do-you-test-rtos-performance/ showing Zephyr performing poorly

andyross commented 1 day ago

Sigh, ew. Apologies in advance for the Torvaldsist rant, but it really has to be said. That is just embarrassingly pessimal. Basically the test has a bunch of threads it wants to run at different times, and is doing it with tm_thread_suspend/resume() calls. Which I guess the porter has mapped to k_thread_suspend/resume().

But again, those aren't Zephyr performance APIs, aren't used by actual apps in the wild, and really aren't something we should be introducing complexity to try to make fast.

We have fast APIs, it just isn't these. We should fix this to make "self-suspend thread N" be "k_sem_take(semaphore_N)", and "resume thread_N" be "k_sem_give(semaphore_N)". Semaphores are, have been, and likely always will be our go-to lightweight/fast/best synchronization primitive.

I'm not opposed in principle to making suspend/resume faster, but not at the cost of complexity and absolutely not if it turns out it's just because we were measuring the wrong thing.

teburd commented 1 day ago

Sigh, ew. Apologies in advance for the Torvaldsist rant, but it really has to be said. That is just embarrassingly pessimal. Basically the test has a bunch of threads it wants to run at different times, and is doing it with tm_thread_suspend/resume() calls. Which I guess the porter has mapped to k_thread_suspend/resume().

But again, those aren't Zephyr performance APIs, aren't used by actual apps in the wild, and really aren't something we should be introducing complexity to try to make fast.

We have fast APIs, it just isn't these. We should fix this to make "self-suspend thread N" be "k_sem_take(semaphore_N)", and "resume thread_N" be "k_sem_give(semaphore_N)". Semaphores are, have been, and likely always will be our go-to lightweight/fast/best synchronization primitive.

I'm not opposed in principle to making suspend/resume faster, but not at the cost of complexity and absolutely not if it turns out it's just because we were measuring the wrong thing.

This is exactly the same API style found in ThreadX (tx_thread_suspend/tx_thread_resume) and FreeRTOS (vTaskSuspend/vTaskResume). The obvious choice is to do the same for Zephyr. The non-obvious choice is I guess whatever was undocumented as the performance API version of these. Not to be too snarky here, really, but we can't expect people to do the non-obvious thing. Particularly if all the other options have like-named things.

peter-mitsis commented 1 day ago

But now it's going to take a timeout ISR at some point in the future and presumably un-suspend unexpectedly?

No. A suspended thread is blocked, but a blocked thread is not necessarily suspended, and sleeping != suspended. When we suspend a thread, we are telling it "You will not run for as long as you are suspended. You may get your resources if you are pended on an object, but you will not run if you are suspended. Your timeouts may expire, but you will not run if you are suspended. The only way to get not-suspended is to be resumed." The world passes you by when you are suspended.

Do I think that the thread_metric preemptive benchmark is great? No. But it is consistent and not unreasonable. Furthermore, it is out there, and Beningo's report and the perception of Zephyr needs to be addressed.

andyross commented 23 hours ago

Meh. ThreadX and FreeRTOS absolutely do have semaphores and proper synchronization tools. Again what I'm saying is that apps should not use unsynchronized suspend/resume for correctness reasons, and that a benchmark based on them is testing something dumb. We can make dumb stuff fast, sure, but it remains dumb and someone needs to say that it's dumb. This benchmark is dumb. It's testing dumb things. We should test better things. That may be an orthogonal point to "we should make dumb things fast", but IMHO it's a more important point.

Stated alternatively: you cannot make an optimization change in good faith without taking serious stock in what you are measuring and why. Blindly chasing benchmarks never leads anywhere happy.

And to the correctness bit:

When we suspend a thread, we are telling it "You will not run for as long as you are suspended. You may get your resources if you are pended on an object, but you will not run if you are suspended.

I don't think that's the case? The thread timeout handler is unconditional as I read it. You'd need to fix it to inspect the SUSPENDED flag and then elide the wakeup, and unless I'm missing it I don't see code for that here? Maybe you could add a test case that suspends a sleeping thread and verifies that its wakeup timeout expires without incident (something we should probably have in the tests already, though currently that would act to unsuspend the thread obviously).

A change which, it needs to be pointed out, is adding code and cycles to the wakeup path (something we know real apps do!) to make the suspend path (weird, rare, and race-prone) faster. Which is now no longer a straightforward optimization and has become a tradeoff made in IMHO the wrong direction.