zephyriot / zephyr-issues

0 stars 0 forks source link

Debugging difficulties on Cortex-M with frame pointer missing #1775

Open nashif opened 7 years ago

nashif commented 7 years ago

Reported by David Brown:

With the commit:

commit f9733e492dd6df4a6deb71c43aaaf8af8ba7b801
Author: Daniel Thompson <daniel.thompson@linaro.org>
Date:   Tue Feb 7 09:15:02 2017 +0000

    misc: Let the compiler choose whether to omit frame pointer

I find debugging difficulties on the FRDM-K64F, even when building with CONFIG_DEBUG=y.

A typical scenario will be that I am stepping through the code, and when I step over a printk message, I'll get output resembling the following:

[MCUBOOT] [INF] boot_status_source: Image 0: magic=good, copy_done=0x1, ima***** USAGE FAULT *****
  Executing thread ID (thread): 0x20001700
  Faulting instruction address:  0xbe00
  Illegal use of the EPSR
Fatal fault in ISR! Spinning...

where the text up to the ***** is supposed to be printed. If the code is run without debugging, this doesn't happen, and if I set:

CONFIG_OVERRIDE_FRAME_POINTER_DEFAULT=y
CONFIG_OMIT_FRAME_OPINTER=n

the problem does not occur.

(Imported from Jira ZEP-1928)

nashif commented 7 years ago

by Daniel Thompson:

nashif commented 7 years ago

by Bogdan Davidoaia:

nashif commented 7 years ago

by Daniel Thompson:

nashif commented 7 years ago

by Bogdan Davidoaia:

nashif commented 7 years ago

by Daniel Thompson:

Hmnn....

To be honest this sounds much more like a tooling issue than something in v1.7.0: disabling the frame pointer is also supported in v1.6.0, its only the default that changes in v1.7.0 because the generated code quality is so much better without a frame pointer.

Can you share more info about your toolset, debugger and JTAG tooling versions? Additionally are you able to reproduce this using any of the samples?

For the record, before posting this comment I double checked that stepping out of (finish), over (next) and through (step, step, ...) printk() works OK. I used BOARD=qemu_cortex_m3 and, somewhat arbitrarily, chose samples/shell as the test application.

nashif commented 7 years ago

by David Brown:

The toolchain is built with the 0.9 release binary toolchain. I can look into doing this with other apps. I can definitely cause the problem (or fix the problem) by changing the setting of the frame pointer compilation flag.

nashif commented 7 years ago

by Daniel Thompson:

It would certainly be useful to reproduce it on other apps. Also aren't there also debug components that might affect reproduction that are outside the scope of the SDK (e.g. OpenOCD, JTAG probe, etc)?

nashif commented 7 years ago

by Sharron LIU:

I did not observe this when debugging "samples/hello" with qemu_x86. Should this issue be moving to "Zehpyr SDK" project?

nashif commented 7 years ago

by Mark Linkmeyer:

Setting the planned FixVersion to v1.8.0 since it's a high priority bug and to meet release criteria all high priority bugs must be fixed.

nashif commented 7 years ago

by Maureen Helm:

I haven't been able to reproduce this either. I used mcuboot and zephyr based on the latest master branches, openocd 0.10, and frdm-k64f. David Brown , are you still seeing this issue?

nashif commented 7 years ago

by David Brown:

Yes, this happens to me consistently unless I set CONFIG_OVERRIDE_FRAME_POINTER_DEFAULT=y. Perhaps it requires CONFIG_CONSOLE_HANDLER=y CONFIG_SYS_LOG=y and CONFIG_DEBUG=y.

An easy way to reproduce this with mcuboot is to set a breakpoint on boot_status_source and then do 'next'. When I step over the first debug BOOT_LOG_SWAP_STATE call, I get the usage fault on the console.

nashif commented 7 years ago

by David Brown:

I don't know if it matters, but I am using the Segger debug firmware on the K64F, not the openocd one.

nashif commented 7 years ago

by Daniel Thompson:

When I saw the priority of this bumped I tried a bit harder to reproduce by testing on real hardware. I don't have a K64F but I did a slash 'n burn port of Zephyr for STM32F4Discovery (which at least uses OpenOCD) and wasn't able to trigger anything (stepping through and stepping over printk() was fine for me).

Have you tried to reproduce this behaviour using the sample programs yet?

nashif commented 7 years ago

by Bogdan Davidoaia:

Managed to reproduce and pinpoint the issue David encountered. He gave me his setup scripts, without which I probably wouldn't have been able to reproduce it either. I also used the Segger firmware with the JLink debugger, but I don't know if the bug is related to this.

The bug actually happens when you put a breakpoint on main. Breakpoints in boot_status_source, printk, _main or even in main using filename + linenumber don't have any effect on the issue.

The bug also reproduces with the hello_world sample without modifying the prj.conf file. As David said, the problem is related to omit-frame-pointer, so with CONFIG_OVERRIDE_FRAME_POINTER_DEFAULT=y and CONFIG_OMIT_FRAME_POINTER=n it doesn't occur. The bug isn't related to printk, as it reproduces even if I delete all the printk calls.

Strangely enough, the bug doesn't happen if I just rename main to my_main in the sample and kernel/init.c. So, I guess either the compiler or the debugger does something different for "main", although haven't figured out what yet.

nashif commented 7 years ago

by Daniel Thompson:

Awesome that you've found a way to reproduce! Is it possible to distill it down a tiny more to give us certainty testing other platforms?

More specifically do the following instructions describe a reproduction using the hello_world example? Can you either confirm or correct the below:

Steps to reproduce:

  1. Burn flash, etc.
  2. Connect via debugger
  3. break main
  4. break printk
  5. run
  6. backtrace

Expected result:

backtrace shows printk() being called from main()

Actual result:

backtrace shows printk() being called from fatal exception handling logic (_UsageFault()?).

nashif commented 7 years ago

by Bogdan Davidoaia:

The steps I used to reproduce it are:

  1. Build hello_world sample from master without any modifications
  2. Flash .bin to board
  3. Connect to console to see printk output
  4. Connect debugger
  5. break main
  6. run
  7. hit main breakpoint
  8. continue
  9. check console output

Expected result: see boot banner and hello world message

Actual result: see boot banner, hello world message and fault message

nashif commented 7 years ago

by Daniel Thompson:

Interesting... I can't reproduce this using openocd/STLink and 96b_carbon. I can't follow step #5 precisely ('The "remote" target does not support "run". Try "help target" or "continue".') but other than that there are no fault messages and behaviour is identical with or without frame pointers.

nashif commented 7 years ago

by Daniel Thompson:

As a final check I tried to get gdb to log remote protocol packets ("set debug remote 1") to I could see if gdb was doing anything materially different when zephyr was built with ([^with_frame_pointer.log]) and without ([^without_frame_pointer.log]) the frame pointer. I'm afraid I couldn't see any real difference. There are obvious differences in the addresses, CRC values and register contents but this is pretty much what you expect from the improved code generation when the compiler don't have to maintain the frame pointer. For sure there are no extra packets nor any other debugger weirdness.

Would you mind repeating the same experiment on the probe+platform where this reproduces?

nashif commented 7 years ago

by David Brown:

I'm pretty sure that gdb does do special handling for the case of main, but it has been a long time since I looked at the code.

nashif commented 7 years ago

by David Brown:

Given that pyocd doesn't show these problems, I wonder if Segger is doing something in their gdb stub based on the 'main' symbol. That'd be a bit disturbing, but is certainly possible.

nashif commented 7 years ago

by Daniel Thompson:

Just seen the traces from Bogdan...

For some reason on frdm_k64f gdb ends up placing hardware breakpoints over what I think is the vector table (0x4/reset without omit fp and 0x3c/systick with omit fp). That definitely doesn't happen on the STM32's I tried this on.

Technically the CPU never executes the vector table so a hardware breakpoint hasn't really got any business firing... however it really is a very odd thing for the debugger to be doing. Especially given all those memory reads in the with-omit-fp case are reading up through the vector table until it reaches 0x3c...

I think I can (sort of) reproduce on 96b_carbon but deliberately slapping an insane breakpoint over my vector table:

(gdb) break main Breakpoint 1 at 0x800043c: file /home/drt/Development/Zephyr/zephyr/samples/hello_world/src/main.c, line 12. (gdb) break *0x800003c Breakpoint 2 at 0x800003c (gdb) break _UsageFault Breakpoint 3 at 0x80016b8: file /home/drt/Development/Zephyr/zephyr/arch/arm/core/fault.c, line 197. (gdb) c Continuing. Note: automatically using hardware breakpoints for read-only addresses.

Breakpoint 1, main () at /home/drt/Development/Zephyr/zephyr/samples/hello_world/src/main.c:12 12 printk("Hello World! %s\n", CONFIG_ARCH); (gdb) c Continuing. Info : halted: PC: 0x0800043e

Breakpoint 3, _UsageFault (esf=esf@entry=0x200010d0 <_interrupt_stack+2016>) at /home/drt/Development/Zephyr/zephyr/arch/arm/core/fault.c:197 197 {

[Note that the breakpoint on _UsageFault is not needed to reproduce but I need it to workaround a misfeature in STM32 HAL that prevents the usage fault message from being printed on the UART]

Anyhow the tentative conclusion on my side is that the toolset behaviour looks broken both with and without omit fp but unfortunately without omit fp the toolset selects a more annoying vector to crap on.

nashif commented 7 years ago

by Bogdan Davidoaia:

I managed to reproduce it also with the hello_world sample and openocd v0.10.0, so I don't think it's related to the Segger JLink.

I've set debug remote 1 as Daniel suggested and I indeed see there are extra packets when setting the breakpoint on main when omit-frame-pointer is used (see [^without_omit_fp.log] and [^with_omit_fp.log]). I am not too familiar with how gdb actually does things under its hood, so it would be helpful if you could point what I should look for next given these logs.

I was also wrong when I said the problem is that the function is called 'main'. When I did the renaming to 'my_main' I didn't rename the __weak main function declared at kernel/init.c:201, since I thought that it would be removed, since it wouldn't be used either way. Renaming it, in addition to the occurrences at kernel/init.c:193,195 and in the hello_world sample, and setting the breakpoint to my_main will cause the bug to occur.

Keeping everything named 'main', but deleting just the __weak main function from the source file will cause the bug not to occur. However, I don't know if this function is actually causing the problem, since deleting it could just change the compilation context of the code for main.

nashif commented 7 years ago

by Daniel Thompson:

Technically the CPU never executes the vector table so a hardware breakpoint hasn't really got any business firing...

Reflecting a bit deeper on that this is perhaps not so strange. The Cortex-M4 is a harvard-ish architecture and will load the vector table pointer using I-side memory interface. So its only a guess but I suspect the hardware breakpointing works by observing I-side memory fetches.

nashif commented 7 years ago

by Bogdan Davidoaia:

Guess you could see the attached files before I finished writing my comment :)

What you said about the placing of the breakpoints makes sense, so I probably isn't directly related to the __weak main function.

nashif commented 7 years ago

by Daniel Thompson:

Actually after a little whilst staring at objdump output I think this is a conjunction of:

  1. A breakpoint placed upon a __weak function (weak functions still end up with DWARF records; their PC is set to 0)
  2. A target with the FLASH (and therefore a .text section) at 0x0 (normally gdb skips the records of weak functions because their PC is out of range)
  3. Something (possibly prologue parsing) related to -f[no-]omit-frame-pointer that ends up with the spurious breakpoint in the vector table changing addres.

I still don't fully understand #3 but comparing gdb behaviour ("info address main") and "objdump -g" of binaries from STM32 (FLASH@0x800000) with nRF51 and K64F makes me pretty confident about #1 and #2.

I have a wild and untested theory that we might just be able to solve the problem by writing out the vector table as .rodata rather than .text and hacking the linker scripts accordingly. I think that might persuade gdb not to place a breakpoint on the weak main symbol.

nashif commented 7 years ago

by Daniel Thompson:

Just to be clear... I don't think there is much more I can afford to do to drive this investigation forward.

Ideally I'd like to see a toolset expert involved in order to do a bit of blame allocation. Personally I think #2 is bogus... the debugger is placing a breakpoint on code that doesn't actually exist in the image, however I'm not especially clear how best to arbitrate blame for this between the linker and the debugger.

nashif commented 7 years ago

by Bogdan Davidoaia:

Tried to look more into this today, but haven't managed to make out anything new out of the issue. I don't understand why the debugger takes into account the weak symbol, and doesn't just consider only the address of the strong symbol.

I agree with Daniel that we should get a toolset expert involved in this.

nashif commented 7 years ago

by Mark Linkmeyer:

Hi Maureen Helm , any update on this one? I ask only because I'm looking at all the high priority bugs that are still open. I'm just trying to help see what can be done to get them closed for the upcoming 1.8 release. I scanned the recent comments and see that help from a toolset expert is needed. Any idea who that would be? Thanks!

nashif commented 7 years ago

by Maureen Helm:

Hi Mark Linkmeyer , I brought this one up in our process meeting to ask if you or Anas Nashif might have someone that could help.

nashif commented 7 years ago

by Mark Linkmeyer:

Sorry Maureen Helm , your request slipped through the cracks :-( and I just now got back to this issue as I'm checking where we are with things in preparation for the upcoming release. (Kind of suggests a project-wide Triage Meeting would be useful to prevent this delay from happening again. Let's discuss this in the next Process WG meeting this week.) I've asked my team to see if anyone can help you with this.

nashif commented 7 years ago

by Anas Nashif:

Maureen Helm We do not have anyone that can help on this. I wonder why this is marked as P1 in the first place, its been there for over 2 months and I do not see how this is blocking anyone. I am lowering the priority.

nashif commented 7 years ago

by Anas Nashif:

Already in the original bug report there is a workaround, so I do not think this should block 1.8

nashif commented 7 years ago

by Daniel Thompson:

I've been having a quick go so see if I can reproduce this behaviour with another toolset (I used the Linaro EABI toolset based on gcc 6.3.1).

I don't have a FRDM-K64F so I can't be absolutely sure but based on experiments with qemu (which has flash at 0x0) I think this problem will also go away if you build with the Linaro tools!

As mentioned previously we need three things to reproduce:


1. A breakpoint placed upon a __weak function (weak functions still end up with DWARF records; their PC is set to 0)
2. A target with the FLASH (and therefore a .text section) at 0x0 (normally gdb skips the records of weak functions because their PC is out of range)
3. Something (possibly prologue parsing) related to -f[no-]omit-frame-pointer that ends up with the spurious breakpoint in the vector table changing address.
~~~ cut here ~~~

Using the Linaro toolset is still merely a workaround; the debugger does still attempt to place a breakpoint on the non-existent weak main however there is no pre-amble parsing. The breakpoint is dropped right at 0x0 (Initial SP) which is fairly harmless (or at least no more harmful that placing it on the reset vector).

The behaviour of the Linaro toolset is identical with or without omit-frame-pointer .
nashif commented 7 years ago

by David Brown:

I wonder if this could be solved in Zephyr by not placing the vector table in .text, but maybe giving it its own section.