Open peter-mitsis opened 1 day ago
I believe the changes look great. The performance gains are excellent. The commit messages I believe deserve a more detailed explanation. The why and what of the changes in each commit for this sort of PR really deserve more explanation in my opinion, particularly the last commit. Why is this netting us nearly 10% bump in performance? How did you find this?
Second commit has a small typo, otherwise the messages are very clear and helpful now. Thanks!
"At the present time, Zephyr does has overlap between sleeping and suspending."
Likely meant
"At the present time, Zephyr does have overlap between sleeping and suspending."
Also the expansion of the thread state flags just bugs me aesthetically. We have too many of these already. Maybe we can separate the "true flag" ones (e.g. "ABORTING/SUSPENDING", "QUEUED") with the "enumerative" ones that are-or-at-least-should-be mutually exclusive (DEAD, PENDED, now SLEEPING/SUSPENDED), etc... With some work we could probably move obscure stuff like DUMMY and PRESTART into some other state and get it out of the mask byte. Likewise ABORTING and SUSPENDING are 98% the same state and could be discriminated in other ways than the flags.
And finally: this is the third PR now that's come along chasing performance numbers in k_thread_suspend()
, which really shouldn't need to be a performance path, IMHO. That's an obscure (and extremely race-prone!) API that real apps shouldn't be relying on. This is like chasing a bunch of Linux performance numbers by looking at the cycle times of kill()
.
Can someone point to the test in question?
And finally: this is the third PR now that's come along chasing performance numbers in
k_thread_suspend()
, which really shouldn't need to be a performance path, IMHO. That's an obscure (and extremely race-prone!) API that real apps shouldn't be relying on. This is like chasing a bunch of Linux performance numbers by looking at the cycle times ofkill()
.Can someone point to the test in question?
https://github.com/zephyrproject-rtos/zephyr/tree/main/tests/benchmarks/thread_metric
Originally from embedded.com I believe in 2007, https://www.embedded.com/measure-your-rtoss-real-time-performance/
Stems from a report posted by Beningo (there's a pdf/slide set floating out there...) which you can see some of the results easily at https://www.embedded.com/how-do-you-test-rtos-performance/ showing Zephyr performing poorly
Sigh, ew. Apologies in advance for the Torvaldsist rant, but it really has to be said. That is just embarrassingly pessimal. Basically the test has a bunch of threads it wants to run at different times, and is doing it with tm_thread_suspend/resume()
calls. Which I guess the porter has mapped to k_thread_suspend/resume()
.
But again, those aren't Zephyr performance APIs, aren't used by actual apps in the wild, and really aren't something we should be introducing complexity to try to make fast.
We have fast APIs, it just isn't these. We should fix this to make "self-suspend thread N" be "k_sem_take(semaphore_N)
", and "resume thread_N" be "k_sem_give(semaphore_N)
". Semaphores are, have been, and likely always will be our go-to lightweight/fast/best synchronization primitive.
I'm not opposed in principle to making suspend/resume faster, but not at the cost of complexity and absolutely not if it turns out it's just because we were measuring the wrong thing.
Sigh, ew. Apologies in advance for the Torvaldsist rant, but it really has to be said. That is just embarrassingly pessimal. Basically the test has a bunch of threads it wants to run at different times, and is doing it with
tm_thread_suspend/resume()
calls. Which I guess the porter has mapped tok_thread_suspend/resume()
.But again, those aren't Zephyr performance APIs, aren't used by actual apps in the wild, and really aren't something we should be introducing complexity to try to make fast.
We have fast APIs, it just isn't these. We should fix this to make "self-suspend thread N" be "
k_sem_take(semaphore_N)
", and "resume thread_N" be "k_sem_give(semaphore_N)
". Semaphores are, have been, and likely always will be our go-to lightweight/fast/best synchronization primitive.I'm not opposed in principle to making suspend/resume faster, but not at the cost of complexity and absolutely not if it turns out it's just because we were measuring the wrong thing.
This is exactly the same API style found in ThreadX (tx_thread_suspend/tx_thread_resume) and FreeRTOS (vTaskSuspend/vTaskResume). The obvious choice is to do the same for Zephyr. The non-obvious choice is I guess whatever was undocumented as the performance API version of these. Not to be too snarky here, really, but we can't expect people to do the non-obvious thing. Particularly if all the other options have like-named things.
But now it's going to take a timeout ISR at some point in the future and presumably un-suspend unexpectedly?
No. A suspended thread is blocked, but a blocked thread is not necessarily suspended, and sleeping != suspended. When we suspend a thread, we are telling it "You will not run for as long as you are suspended. You may get your resources if you are pended on an object, but you will not run if you are suspended. Your timeouts may expire, but you will not run if you are suspended. The only way to get not-suspended is to be resumed." The world passes you by when you are suspended.
Do I think that the thread_metric preemptive benchmark is great? No. But it is consistent and not unreasonable. Furthermore, it is out there, and Beningo's report and the perception of Zephyr needs to be addressed.
Meh. ThreadX and FreeRTOS absolutely do have semaphores and proper synchronization tools. Again what I'm saying is that apps should not use unsynchronized suspend/resume for correctness reasons, and that a benchmark based on them is testing something dumb. We can make dumb stuff fast, sure, but it remains dumb and someone needs to say that it's dumb. This benchmark is dumb. It's testing dumb things. We should test better things. That may be an orthogonal point to "we should make dumb things fast", but IMHO it's a more important point.
Stated alternatively: you cannot make an optimization change in good faith without taking serious stock in what you are measuring and why. Blindly chasing benchmarks never leads anywhere happy.
And to the correctness bit:
When we suspend a thread, we are telling it "You will not run for as long as you are suspended. You may get your resources if you are pended on an object, but you will not run if you are suspended.
I don't think that's the case? The thread timeout handler is unconditional as I read it. You'd need to fix it to inspect the SUSPENDED flag and then elide the wakeup, and unless I'm missing it I don't see code for that here? Maybe you could add a test case that suspends a sleeping thread and verifies that its wakeup timeout expires without incident (something we should probably have in the tests already, though currently that would act to unsuspend the thread obviously).
A change which, it needs to be pointed out, is adding code and cycles to the wakeup path (something we know real apps do!) to make the suspend path (weird, rare, and race-prone) faster. Which is now no longer a straightforward optimization and has become a tradeoff made in IMHO the wrong direction.
Finally: if the whole point is to remove the call to abort_timeout(), maybe we could just make that faster instead? That is not a complicated operation for the case exercised here. Here's the meat: https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/timeout.c#L144
It literally looks like you'd expect "If there is no timeout, do nothing". So if this is actually slow:
sys_dnode_is_linked()
, which a single pointer NULL compare and really can't be made much faster.What I wrote is too complicated. Simple version: "abort timeout" should just be a quick check that a pointer field is null. If that's not what it's doing, let's fix that first before re-architecting sleep.
Another suggestion: very likely what our competitors are doing in that code is detecting the case where we're suspending the current thread and then implementing an optimized handler[1]. Whereas our suspend implementation is a general thing intended to asynchronously stop a thread in any state (and across CPUs!). Maybe we could add a test for that at the very top and do even better still?
[1] Among other things, _current never has a timeout, heh.
Yeah, I'm liking that better. So with as pretentious a flourish as I can muster, here's my mostly-but-not-entirely untested (it runs tests/kernel/common and tests/kernel/sched/schedule_api on qemu_cortex_m3) attempt at "Kick ThreadX's butt on a dumb benchmark" patch:
diff --git a/kernel/sched.c b/kernel/sched.c
index eda1a3e0908..cf5094639cc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -499,6 +499,21 @@ void z_impl_k_thread_suspend(k_tid_t thread)
{
SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_thread, suspend, thread);
+ /* There are benchmarks in the wild which use
+ * k_thread_suspend() to force a context switch, which is not
+ * really an appropriate use of the API (suspend is inherently
+ * unsynchronized and deadlocky!). But make it fast just for
+ * numbers.
+ */
+ if (thread == _current && !IS_ENABLED(CONFIG_SMP)) {
+ K_SPINLOCK(&_sched_spinlock) {
+ z_mark_thread_as_suspended(thread);
+ dequeue_thread(thread);
+ update_cache(1);
+ }
+ return;
+ }
+
(void)z_abort_thread_timeout(thread);
k_spinlock_key_t key = k_spin_lock(&_sched_spinlock);
Basically this just detects the specific case the benchmark is hitting and extracts the minimal work required to suspend the current thread, doing nothing else. The SMP test is there just to make the dequeue_thread() call unconditional, you'd otherwise have to test is_queued(). And as we aren't obviously losing SMP benchmarks to UP only toy systems, I didn't see the point.
The SMP test is there just to make the dequeue_thread() call unconditional, you'd otherwise have to test is_queued().
Which is wrong, sorry. _current is never queued on SMP, so you'd just do this to support both:
// oops
Never mind. There's more stuff to do in SMP like handling async racing aborts and stuff and it'll never be this simple there. So the original as written is best.
Your proposal still has suspend muck with the timeout, which it should not do.
A thread pending on a semaphore with a timeout should not lose the timeout if the thread is suspended. The following sequence should be valid ...
Presently, Zephyr behaves incorrectly--it aborts the timeout on suspend and turns it into a K_FOREVER. One of the aims of this patch set is to correct that. The manner in which it corrects it both introduces _THREAD_SLEEPING, and more clearly separates suspend from sleeping. In the process, it also improves the benchmark performance.
Huh. Looks like it actually works. Submitted in #81737
Presently, Zephyr behaves incorrectly
That's a reasonable argument. But it's behaving as documented, I believe, and it's done it for as long as the APIs have existed. Suspending a thread overrides any sleep it might be doing and guarantees the kernel will do no work on behalf of the thread in the future, until it's resumed (at which point it returns from k_sleep(), potentially earlier than intended -- again early exit from k_sleep() is a documented behavior).
I'm not saying no to redoing sleep (though I'd really like to see it done in a way that doesn't duplicate code), I'm just freaking out that we're trying it out of desire to chase a benchmark.
Originally, it was explicitly documented that the timer would keep ticking. The documentation for k_thread_suspend() in kernel.h was changed back in 2019 (see SHA ID 50d0942f5e70885a8cd4f97e21e9059263d713c1) as part of #20947 (for issue #20033)
There is documentation that has been unchanged since it was first written back in 2015 under doc/kernel/services/threads/index.rst wherein there is a note clearly stating that k_sleep() is different from suspending a thread.
Nice catch, I have no memory of that exchange. But again, doesn't this patch just re-open #20033? That user viewed the behavior in this PR as a fatal bug. The behavior I think the old documentation wanted to see, but that never actually existed, doesn't exist here either, no? If you suspend a thread after this merges, and it has a live timeout, that timeout will wake it up unexpectedly. Or am I misreading? Needs a test for sure.
I can write a test and add it to the patch. :)
Came back up to check. Yeah, all but certain this PR needs more. Right now the timeout handler looks like this and calls directly into z_sched_wake_thread(). That's going to unconditionally[1] mark the thread as not suspended and then call into ready_thread(). So it's guaranteed to ignore whatever suspend might have done.
https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/sched.c#L614
[1] It does have a test for dead/aborting threads, which looks suspiciously needless to me as I'm pretty sure abort is now race free. It's possible this is some vestigial code trying to address misbehavior in abort that could now be removed, though I'd be moderately afraid of touching this. Digression, but bears investigation.
... Right now the timeout handler looks like this and calls directly into z_sched_wake_thread(). That's going to unconditionally[1] mark the thread as not suspended and then call into ready_thread().
The 3rd commit in this set updates z_sched_wake_thread() to replace the unconditional not suspended with an unconditional not sleeping.
As has been previously noted elsewhere, Zephyr's performance on the thread_metric preemptive benchmark had been observed to be significantly behind that of some other RTOSes such as ThreadX. One of the contributing factors of this has been the call to z_abort_thread() in k_thread_suspend(). Suspending a thread really should be orthogonal to timeouts and sleeps. This set of commits aims to both correct that and improve Zephyr's preemptive benchmark performance. When applied atop of the other performance related PRs, this patch set gives us numbers that are about 9% better.
To decouple the two, a new thread state has been introduced--_THREAD_SLEEPING. As all the existing bits in the thread_state field are used, the size of this field must be increased from an 8-bit field. Should this field be increased on its own, this will introduce padding gaps in the layout of the _thread_base structure. To counteract the padding two other fields have also had their sizes modified. user_options has been increased to 16 bits (it was getting closer to being full). cpu_mask has been made to always be 16-bits. These changes can be expected to have an impact on 3rd party tools.
This decoupling also results in some behavior changes.
Below are some performance numbers using the thread_metric's preemptive benchmark with multiq on the disco_l475_iot1 (higher is better)
Main branch: 5731301 This commit atop main: 5854334 PR #81311 and #81677 together: 6501273 This commit atop both #81311 and #81677: 7034780