Closed stephanosio closed 1 year ago
For now, the quickest workaround is to specify apparently, our codebase doesn't like this: https://github.com/zephyrproject-rtos/zephyr/pull/48994/checks?check_run_id=7802767311 Let's just disable this test for the affected platforms.-mfaster-structs
, which forces all structs to have 8-byte alignment.
The root cause still needs to be investigated and fixed.
Took a quick look. Agree that the alignment guess seems close. After checking to be sure this wasn't a stack overflow (seriously: always check the stacks on weird stuff like this) I got it localized to a particular test case that isn't doing anything but creating a thread and joining it synchronously. See notes in this patch: https://gist.github.com/andyross/d2edbd77463000e57b4c7a39cac31ed7
I thought it was a race at first, so I added a delay between the steps and that "fixed" it, as did a yield. But so did a busywait, which is more suspicious (that shouldn't generally cause a context switch), and so did a busy wait of 1us, which was weird. And so did a hand-written delay. And so did the same hand written delay with a loop count of 1 that adds only four instructions and a stack word! (And indeed, AFAICT the spawned thread never runs before the join in that final "working" case, it's not a race between them at all).
But hand-adding four NOPs did not fix the problem, nor did my attempts to expand the stack frame by a word via other means. So no root cause yet.
But this definitely looks like a glitch in the platform context switch code. That pthread_join() is just going to call arch_switch() synchronously. My guess is that there's an edge case in there that gets something wrong based on some oddity about the stack frame + instruction pointer state.
But it could totally be a bug in the pthread code too that just happens to fail here, or in the kernel for that matter. But my intuition points at the arch code.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
Describe the bug
The
tests/posix/common/portability.posix.common
test fails on theqemu_leon3
target:This assertion failure occurs after the test successfully completes and the test thread is aborted (fails to swap):
https://github.com/zephyrproject-rtos/zephyr/blob/6cfb18686e1c494d0011aec87bbf24ab530d3a34/kernel/sched.c#L1761-L1765
To Reproduce
Build
tests/posix/common/portability.posix.common
forqemu_leon3
and run inside QEMU.Expected behavior
Test passes.
Impact
CI reports a failure.
Logs and console output
Environment (please complete the following information):
Additional context
Bisected to 9a4b5e1d908575a552f15204a51c92c9732f3d84; however this commit, in itself, does not seem to be doing anything wrong and the failure is likely due to a more serious underlying issue.
This failure is likely due to an alignment-related issue:
https://github.com/zephyrproject-rtos/zephyr/blob/9a4b5e1d908575a552f15204a51c92c9732f3d84/subsys/testsuite/include/zephyr/tc_util.h#L56-L59
Changing the number of
=
s by -7 or +7 makes the assertion failure go away.Related discussion on Discord: https://discord.com/channels/720317445772017664/733037890514321419/1007518292732289034
p.s. That this test also fails when compiled using the Zephyr SDK 0.15.0, so the issue is not toolchain or QEMU version specific.