Closed RobMeades closed 3 months ago
@RobMeades please use the bug report template, thanks!
@ajarmouni-st: updated.
Can you please bisect and provide the sha where this started happening?
Ummm... not easily! There have been rather a lot of changes in the N months since 3.6.0. Are there any particular areas you might like me to bisect around? To be completely clear, I wasn't looking for a fix, I was looking for the Zephyr-defined debug method for what appears to be a heap corruption issue.
Ummm... not easily! There have been rather a lot of changes in the N months since 3.6.0
To clarify: git bisect
will binary search that space for you. My quick count says that there are 6528 patches merged between v3.6.0 and v3.7.0-rc1. That will require ~13 individual retests. See the man page or StackOverflow or whatever for more details, but basically:
git bisect start
git reset --hard v3.6.0
(probably re-test this to be sure it's a good commit for you)git bisect good
git reset --hard v3.7.0-rc1
(or HEAD, or an older failing commit if you have one)git bisect bad
git bisect
"good
" or "bad
" according to resultThere may be complexities between the versions if e.g. the problem is due to a west dependency, in which case a "west update" at each step would be needed to see it. Likewise sometimes (but rarely) an SDK change will introduce a regression, etc...
Thanks for the explanation: this is being run on a regression test system which takes a git
tag as its source so we don't have the freedom of using git bisect
. I will have to see if there is any way that we can run locally somehow and then figure out how to use git bisect
.
But the question asked original]y remains: what mechanisms does Zephyr provide to debug heap corruption? We could employ such mechanisms to find out who did what to whom without the need to accommodate git bisect
.
I have now managed to run the testing locally and use that rather useful git bisect
command: the offending commit seems to be 6cd7936f575e511f23bf75288f0dae25bad2fa51 "kernel: align kernel stack size declaration".
Also, while sorting the testing out, I happened to run it on main
and that appears not to show the problem so I will bisect
my way from v3.7.0-rc1
to main
in case there has been a fix.
EDIT: no, I was mistaken, the problem does still occur on main
.
If that's really the problem commit, my guess is that you simply have a stack overflow and rearranging the offsets is enough to push you over the edge. @dcpleung might have more input. Have you tried just increasing the stack sizes in your test (I'm gathering this is out-of-tree code?) to see if one of them is the culprit?
We call k_thread_stack_space_get()
on the tasks in question as part of the test and that reports 872 bytes free out of 1156 allocated; I know that doesn't necessarily mean there isn't a stack overflow but, on the other hand, the margin seems quite large. The stack sizes we use during testing are common across the platforms we support and I know that other platforms require more stack than Zephyr does, hence the stacks tend to be over-sized.
Does the Zephyr heap have the option of switching on guards that are checked on a more frequent basis?
I've just noticed this detailed comment in the abstraction we use to manage dynamic threads on Zephyr, above code which uses the Z_KERNEL_STACK_SIZE_ADJUST
macro:
// Zephyr doesn't officially support dynamically allocated stack memory.
// For this reason we need to do some alignment hacks here.
// When CONFIG_USERSPACE is enabled Zephyr will check if the stack is
// "user capable" and then decide whether to use kernel or user space hosted
// threads. When we pass it a stack pointer from the heap Zephyr will
// decide to use kernel hosted thread. This is very important at least for
// 32 bit ARM arch where MPU is enabled. For user space hosted threads
// the stack alignment requirement is the nearest 2^x of the stack size.
// Since the only way to align dynamically allocated memory is to adjust the
// pointer after allocation we would in this case need to allocate the double
// stack size which of course isn't a solution.
// Luckily when the thread is kernel hosted the stack alignment is much lower
// since then only a small MPU guard region is added at the top of the stack.
// This decreases the stack alignment requirement to 32 bytes.
//
// For the above reason the code below will use the Z_KERNEL_STACK_xx defines
// instead of Z_THREAD_STACK_xx.
size_t stackAllocSize;
// Other architectures may have other alignment requirements so just add
// a simple check that we don't waste a huge amount dynamic memory due to
// aligment.
U_ASSERT(Z_KERNEL_STACK_OBJ_ALIGN <= 512);
// Z_KERNEL_STACK_SIZE_ADJUST() will add extra space that Zephyr may require and
// to make sure correct allignment we allocate Z_KERNEL_STACK_OBJ_ALIGN extra.
stackAllocSize = Z_KERNEL_STACK_OBJ_ALIGN + Z_KERNEL_STACK_SIZE_ADJUST(stackSizeBytes);
threadPtr->pStackAllocation = k_malloc(stackAllocSize);
// Do the stack alignment
threadPtr->pStack = (k_thread_stack_t *)ROUND_UP(threadPtr->pStackAllocation,
Z_KERNEL_STACK_OBJ_ALIGN);
If I swap Z_KERNEL_STACK_SIZE_ADJUST
for K_KERNEL_STACK_LEN
in that code, the problem goes away. Question is, is that the right fix or does it just happen to move memory around so that the code happens to work? @andyross: what do you think?
You can now use k_thread_stack_alloc()
to allocate stack space.
You can now use
k_thread_stack_alloc()
to allocate stack space.
Interesting: I don't want to make too many changes just now and we have to remain compatible with previous versions of Zephyr (since we haven't moved forward on the nRFConnect SDK-based version) but I will make a note to look into that later.
In that case, I recommend changing stackAllocSize
to use K_KERNEL_STACK_LEN
(if kernel thread) instead. And use k_aligned_alloc(Z_KERNEL_STACK_OBJ_ALIGN, stackAllocSize)
for allocation, and this aligns the beginning of stack for you. Please remember all these can only be used for kernel threads.
So, just to be clear then, the code should become:
// K_KERNEL_STACK_LEN() will add extra space that Zephyr may require.
stackAllocSize = K_KERNEL_STACK_LEN(stackSizeBytes);
threadPtr->pStack = k_aligned_alloc(Z_KERNEL_STACK_OBJ_ALIGN, stackAllocSize);
...and we can lose pStackAllocation
, just use pStack
, since the allocation is already aligned?
I think that should work.
Great, many thanks for your help and suggestions, I will try that and report back.
Confirmed that works for us, going to close this issue now, thanks again guys!
Discussed in https://github.com/zephyrproject-rtos/zephyr/discussions/75077
From the template
Target platform:
nucleo_f767zi
What have you tried to diagnose or workaround this issue: nothing yet, that is essentially my question, what facilities does Zephyr provide to debug heap issues. Is this a regression: yes, tagv3.6.0
good, tagv3.7.0-rc1
bad.To Reproduce I don't expect you to be able to reproduce this, but...
nucleo_f767zi
board via SWD.portEventQueue()
throws the error.Expected behavior No exception should occur, freeing memory should succeed without an exception.
Impact Can't move to Zephyr 3.7.0.
Logs and console output See above.
Environment:
nucleo_f767zi
board from a Raspberry Pi using OpenOCD, SWD monitored using OpenOCD.Additional context Build directory, containing
CMakeLists.txt
,prj.conf
and the.overlay
files can be found here: https://github.com/u-blox/ubxlib/tree/master/port/platform/zephyr/runner_stm32, the only addition to what is there beingCONFIG_DCACHE=n
for3.7.0-rc1
which was recommended here: https://github.com/zephyrproject-rtos/zephyr/discussions/75032. Note that we build and run the same thing but for an STM32U5 (nucleo_u575zi_q
board) and that passes the test fine, no exception.