Zephyr 3.7.0-rc1 issue: heap failure

RobMeades commented 3 months ago

Discussed in https://github.com/zephyrproject-rtos/zephyr/discussions/75077

^{Originally posted by **RobMeades** June 27, 2024} I'm testing the move from 3.6.0 to 3.7.0 and one of our tests (running on an STM32F7), which has run happily for years on Zephyr (3.6.0 and earlier), Linux, Windows, etc, is now failing when it tries to free a memory block: ``` E: ***** BUS FAULT ***** E: Imprecise data bus error E: r0/a1: 0x20028400 r1/a2: 0x00002dbf r2/a3: 0x00003fff E: r3/a4: 0x00003fff r12/ip: 0x200142e0 r14/lr: 0x0806adc1 E: xpsr: 0x81000000 E: r4/v1: 0x20028400 r5/v2: 0x0000059b r6/v3: 0x00002cd8 E: r7/v4: 0x2002b0d8 r8/v5: 0x00000000 r9/v6: 0x00000000 E: r10/v7: 0x00000000 r11/v8: 0x00000000 psp: 0x20051f08 E: EXC_RETURN: 0xfffffffd E: Faulting instruction address (r15/pc): 0x0806aea6 E: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0 E: Current thread: 0x20021b90 (main) ### Dumping current thread (main) ### 0x0806aea6: free_chunk /home/zephyr_native-v3.7.0-rc1/zephyr/lib/heap/heap.c:144 (discriminator 1) 0x0806adbc: free_list_remove /home/zephyr_native-v3.7.0-rc1/zephyr/lib/heap/heap.c:64 0x08042e2c: sys_heap_free /home/zephyr_native-v3.7.0-rc1/zephyr/lib/heap/heap.c:197 0x08056bf6: k_heap_free /home/zephyr_native-v3.7.0-rc1/zephyr/kernel/kheap.c:153 0x08077a96: k_free /home/zephyr_native-v3.7.0-rc1/zephyr/kernel/mempool.c:54 ``` Has anyone else seen anything like this and are there any heap debug options (guards etc.) in Zephyr that would help us track down what's going on?

From the template

Target platform: nucleo_f767zi What have you tried to diagnose or workaround this issue: nothing yet, that is essentially my question, what facilities does Zephyr provide to debug heap issues. Is this a regression: yes, tag v3.6.0 good, tag v3.7.0-rc1 bad.

To Reproduce I don't expect you to be able to reproduce this, but...

Clone https://github.com/u-blox/ubxlib.
Follow installation instructions for native Zephyr.
Follow the instructions to build and flash our standard "runner" test build for an STM32-based Zephyr platform.
Connect a debugger to a nucleo_f767zi board via SWD.
Download the binary to this board.
Reset the board and watch the SWD output and you will see that the test portEventQueue() throws the error.

Expected behavior No exception should occur, freeing memory should succeed without an exception.

Impact Can't move to Zephyr 3.7.0.

Logs and console output See above.

Environment:

OS: Zephyr binary is built on Ubuntu, downloaded to the nucleo_f767zi board from a Raspberry Pi using OpenOCD, SWD monitored using OpenOCD.
Toolchain: GCC ARM 13.2.1.
Commit SHA or Version used: https://github.com/zephyrproject-rtos/zephyr/commit/66630961627f4d8625e2dbe1cb7afe4abd690b2e

Additional context Build directory, containing CMakeLists.txt, prj.conf and the .overlay files can be found here: https://github.com/u-blox/ubxlib/tree/master/port/platform/zephyr/runner_stm32, the only addition to what is there being CONFIG_DCACHE=n for 3.7.0-rc1 which was recommended here: https://github.com/zephyrproject-rtos/zephyr/discussions/75032. Note that we build and run the same thing but for an STM32U5 (nucleo_u575zi_q board) and that passes the test fine, no exception.

ajarmouni-st commented 3 months ago

@RobMeades please use the bug report template, thanks!

RobMeades commented 3 months ago

@ajarmouni-st: updated.

nashif commented 3 months ago

Can you please bisect and provide the sha where this started happening?

RobMeades commented 3 months ago

Ummm... not easily! There have been rather a lot of changes in the N months since 3.6.0. Are there any particular areas you might like me to bisect around? To be completely clear, I wasn't looking for a fix, I was looking for the Zephyr-defined debug method for what appears to be a heap corruption issue.

andyross commented 3 months ago

Ummm... not easily! There have been rather a lot of changes in the N months since 3.6.0

To clarify: git bisect will binary search that space for you. My quick count says that there are 6528 patches merged between v3.6.0 and v3.7.0-rc1. That will require ~13 individual retests. See the man page or StackOverflow or whatever for more details, but basically:

git bisect start
git reset --hard v3.6.0 (probably re-test this to be sure it's a good commit for you)
git bisect good
git reset --hard v3.7.0-rc1 (or HEAD, or an older failing commit if you have one)
git bisect bad
Repeat until git tells you to stop:
- Run Test
- git bisect "good" or "bad" according to result

There may be complexities between the versions if e.g. the problem is due to a west dependency, in which case a "west update" at each step would be needed to see it. Likewise sometimes (but rarely) an SDK change will introduce a regression, etc...

RobMeades commented 3 months ago

Thanks for the explanation: this is being run on a regression test system which takes a git tag as its source so we don't have the freedom of using git bisect. I will have to see if there is any way that we can run locally somehow and then figure out how to use git bisect.

RobMeades commented 3 months ago

But the question asked original]y remains: what mechanisms does Zephyr provide to debug heap corruption? We could employ such mechanisms to find out who did what to whom without the need to accommodate git bisect.

RobMeades commented 3 months ago

I have now managed to run the testing locally and use that rather useful git bisect command: the offending commit seems to be 6cd7936f575e511f23bf75288f0dae25bad2fa51 "kernel: align kernel stack size declaration".

Also, while sorting the testing out, I happened to run it on main and that appears not to show the problem so I will bisect my way from v3.7.0-rc1 to main in case there has been a fix.

EDIT: no, I was mistaken, the problem does still occur on main.

andyross commented 3 months ago

If that's really the problem commit, my guess is that you simply have a stack overflow and rearranging the offsets is enough to push you over the edge. @dcpleung might have more input. Have you tried just increasing the stack sizes in your test (I'm gathering this is out-of-tree code?) to see if one of them is the culprit?

RobMeades commented 3 months ago

We call k_thread_stack_space_get() on the tasks in question as part of the test and that reports 872 bytes free out of 1156 allocated; I know that doesn't necessarily mean there isn't a stack overflow but, on the other hand, the margin seems quite large. The stack sizes we use during testing are common across the platforms we support and I know that other platforms require more stack than Zephyr does, hence the stacks tend to be over-sized.

Does the Zephyr heap have the option of switching on guards that are checked on a more frequent basis?

RobMeades commented 3 months ago

I've just noticed this detailed comment in the abstraction we use to manage dynamic threads on Zephyr, above code which uses the Z_KERNEL_STACK_SIZE_ADJUST macro:

            // Zephyr doesn't officially support dynamically allocated stack memory.
            // For this reason we need to do some alignment hacks here.
            // When CONFIG_USERSPACE is enabled Zephyr will check if the stack is
            // "user capable" and then decide whether to use kernel or user space hosted
            // threads. When we pass it a stack pointer from the heap Zephyr will
            // decide to use kernel hosted thread. This is very important at least for
            // 32 bit ARM arch where MPU is enabled. For user space hosted threads
            // the stack alignment requirement is the nearest 2^x of the stack size.
            // Since the only way to align dynamically allocated memory is to adjust the
            // pointer after allocation we would in this case need to allocate the double
            // stack size which of course isn't a solution.
            // Luckily when the thread is kernel hosted the stack alignment is much lower
            // since then only a small MPU guard region is added at the top of the stack.
            // This decreases the stack alignment requirement to 32 bytes.
            //
            // For the above reason the code below will use the Z_KERNEL_STACK_xx defines
            // instead of Z_THREAD_STACK_xx.

            size_t stackAllocSize;
            // Other architectures may have other alignment requirements so just add
            // a simple check that we don't waste a huge amount dynamic memory due to
            // aligment.
            U_ASSERT(Z_KERNEL_STACK_OBJ_ALIGN <= 512);
            // Z_KERNEL_STACK_SIZE_ADJUST() will add extra space that Zephyr may require and
            // to make sure correct allignment we allocate Z_KERNEL_STACK_OBJ_ALIGN extra.
            stackAllocSize = Z_KERNEL_STACK_OBJ_ALIGN + Z_KERNEL_STACK_SIZE_ADJUST(stackSizeBytes);
            threadPtr->pStackAllocation = k_malloc(stackAllocSize);
            // Do the stack alignment
            threadPtr->pStack = (k_thread_stack_t *)ROUND_UP(threadPtr->pStackAllocation,
                                                             Z_KERNEL_STACK_OBJ_ALIGN);

If I swap Z_KERNEL_STACK_SIZE_ADJUST for K_KERNEL_STACK_LEN in that code, the problem goes away. Question is, is that the right fix or does it just happen to move memory around so that the code happens to work? @andyross: what do you think?

dcpleung commented 3 months ago

You can now use k_thread_stack_alloc() to allocate stack space.

RobMeades commented 3 months ago

You can now use k_thread_stack_alloc() to allocate stack space.

Interesting: I don't want to make too many changes just now and we have to remain compatible with previous versions of Zephyr (since we haven't moved forward on the nRFConnect SDK-based version) but I will make a note to look into that later.

dcpleung commented 3 months ago

In that case, I recommend changing stackAllocSize to use K_KERNEL_STACK_LEN (if kernel thread) instead. And use k_aligned_alloc(Z_KERNEL_STACK_OBJ_ALIGN, stackAllocSize) for allocation, and this aligns the beginning of stack for you. Please remember all these can only be used for kernel threads.

RobMeades commented 3 months ago

So, just to be clear then, the code should become:

            // K_KERNEL_STACK_LEN() will add extra space that Zephyr may require.
            stackAllocSize = K_KERNEL_STACK_LEN(stackSizeBytes);
            threadPtr->pStack = k_aligned_alloc(Z_KERNEL_STACK_OBJ_ALIGN, stackAllocSize);

...and we can lose pStackAllocation, just use pStack, since the allocation is already aligned?

dcpleung commented 3 months ago

I think that should work.

RobMeades commented 3 months ago

Great, many thanks for your help and suggestions, I will try that and report back.

RobMeades commented 3 months ago

Confirmed that works for us, going to close this issue now, thanks again guys!

zephyrproject-rtos / zephyr

Zephyr 3.7.0-rc1 issue: heap failure #75084

Discussed in https://github.com/zephyrproject-rtos/zephyr/discussions/75077