Closed keith-packard closed 1 year ago
Thanks Keith for the initial investigations, I'll assign to @evgeniy-paltsev.
Hi @keith-packard
as I can see you are not using release SDK version. I'm wondering if I can download this 0.16.2-rc1-5-g8d9b4fe SDK from somewhere? Or I need to build it myself?
Thanks!
sorry about the random SDK build. Turns out the regular 0.16.2-rc1 also shows the problem, get that from here:
Note that the error I get is this:
ASSERTION FAIL [!sys_dnode_is_linked(&to->node)] @ ZEPHYR_BASE/kernel/timeout.c:110
@keith-packard I'm wondering it this issue reproduces every time or not?
I've run twister 50 times in a loop - and I'm not able to reproduce it. I was using v0.16.2-rc1 SDK (exact version as you've proposed last time), exact same twister command and exact same zephyr commit (https://github.com/zephyrproject-rtos/zephyr/commit/8560fb319249e3fad3122f8541df4c4beef6e5e2).
Note that the error I get is this: ASSERTION FAIL [!sys_dnode_is_linked(&to->node)] @ ZEPHYR_BASE/kernel/timeout.c:110
May I also ask you to send full error log?
And may I ask you to send your value of git describe --abbrev=12 --always HEAD
in Zephyr repo? (at the repo state when you can reproduce issue)
This string is embedded in Zephyr binary - so it may cause code/data layout change.
oof, I was hoping this would be easy to reproduce...
I think the only bits that should affect this are the Zephyr version and the SDK version, right?
$ arc64-zephyr-elf-gcc --version
arc64-zephyr-elf-gcc (Zephyr SDK 0.16.2-rc1) 12.2.0
$ git describe --abbrev=12 --always HEAD
zephyr-v3.4.0-1074-g18636af6d665
I made sure to run 'west update' in the zephyr repo before building. Here's the commands and their output; you can see that it's just hanging today, not generating an error message.
$ west update
=== updating acpica (modules/lib/acpica):
HEAD is now at f16a0b4d0 Update version to 20230628
=== updating canopennode (modules/lib/canopennode):
HEAD is now at dec12fa stack: map CANopenNode crc16_ccitt() to Zephyr crc16_itu_t()
=== updating chre (modules/lib/chre):
HEAD is now at b7955c27 Merge pull request #7 from stephanosio/require_full_libcpp
=== updating cmsis (modules/hal/cmsis):
HEAD is now at 74981bf Update DSP library for additional platforms
=== updating edtt (tools/edtt):
HEAD is now at 64e5105 edttt_bsim: Avoid infinite exception loop
=== updating fatfs (modules/fs/fatfs):
HEAD is now at 427159b zephyr: fix ffconf.h override use of STRINGIZE
=== updating hal_altera (modules/hal/altera):
HEAD is now at 0d225dd zephyr: update include paths to use <zephyr/...>
=== updating hal_atmel (modules/hal/atmel):
HEAD is now at 5ab4300 Remove LITTLE_ENDIAN and __LITTLE_ENDIAN defines
=== updating hal_espressif (modules/hal/espressif):
HEAD is now at 9531c1c178 hal: flash: avoid lock in ISR
=== updating hal_ethos_u (modules/hal/ethos_u):
HEAD is now at 90ada2e Clarify README regarding CPU choice
=== updating hal_gigadevice (modules/hal/gigadevice):
HEAD is now at 2994b7d README.md: add pllmf and pack exception for gd32a50x
=== updating hal_infineon (modules/hal/infineon):
HEAD is now at 0bebc14 hal: infineon: Add blob support for BLESS
=== updating hal_microchip (modules/hal/microchip):
HEAD is now at 5d079f1 mpfs:CMakeLists.txt: Adding mss_gpio directory to build
=== updating hal_nordic (modules/hal/nordic):
HEAD is now at 20ae0a7 nrfx: Update to version 3.1.0
=== updating hal_nuvoton (modules/hal/nuvoton):
HEAD is now at 8a2b5de hal: NuMaker M46x: fix warnings of M467 StdDriver
=== updating hal_nxp (modules/hal/nxp):
HEAD is now at e0116b8e s32k3: remove CONFIG_NXP_S32_PM
=== updating hal_openisa (modules/hal/openisa):
HEAD is now at d1e61c0 vega_sdk_riscv: Ignore GCC warning about potential uninitialized variable
=== updating hal_quicklogic (modules/hal/quicklogic):
HEAD is now at b3a66fe Remove references to FPGA IP core drivers
=== updating hal_renesas (modules/hal/renesas):
HEAD is now at f2d791d Add da1469x_hal
=== updating hal_rpi_pico (modules/hal/rpi_pico):
HEAD is now at b7801e4 Merge pull request #4 from soburi/update-1.5.0
=== updating hal_silabs (modules/hal/silabs):
HEAD is now at a143f03 gecko: add pa-conversions for efr32xg27
=== updating hal_st (modules/hal/st):
HEAD is now at 5948f7b3 sensor/stmemsc: Align stmemsc i/f to v2.02
=== updating hal_stm32 (modules/hal/stm32):
HEAD is now at 65b64134 dts: st: wba: generate stm32wba*-pinctrl.dtsi wba addition
=== updating hal_telink (modules/hal/telink):
HEAD is now at 38573af B91 ADC driver support
=== updating hal_ti (modules/hal/ti):
HEAD is now at ae1db23 zephyr: update SYS_INIT calls
=== updating hal_wurthelektronik (modules/hal/wurthelektronik):
HEAD is now at 24ca987 Initialized WE HAL module repository.
=== updating hal_xtensa (modules/hal/xtensa):
HEAD is now at 41a631d cmake: enable using SOC_TOOLCHAIN_NAME for SoC overlay
=== updating libmetal (modules/hal/libmetal):
HEAD is now at b91611a lib: update libmetal to release v2023.04.0
=== updating liblc3 (modules/lib/liblc3):
HEAD is now at 448f3de Merge pull request #8 from carlescufi/rename
=== updating littlefs (modules/fs/littlefs):
HEAD is now at ca583fd Merge tag 'v2.5.0' of https://github.com/littlefs-project/littlefs into zephyr
=== updating loramac-node (modules/lib/loramac-node):
HEAD is now at ce57712f Don't memset the TXRX buffer when radio is set to receive mode (#1345)
=== updating lvgl (modules/lib/gui/lvgl):
HEAD is now at f7cf01d4 Merge pull request #41 from faxe1008/update_8.3.7
=== updating lz4 (modules/lib/lz4):
HEAD is now at 8e303c2 zephyr: make this repo a zephyr module
=== updating mbedtls (modules/crypto/mbedtls):
HEAD is now at 6e7841e5a Move the implication of MBEDTLS_PSA_CRYPTO_CLIENT where it belongs
=== updating mcuboot (bootloader/mcuboot):
HEAD is now at 6a8746d7 boot_serial: fix image number handle in image upload request
=== updating mipi-sys-t (modules/debug/mipi-sys-t):
HEAD is now at a819419 library: allow building without wchar support
=== updating nanopb (modules/lib/nanopb):
HEAD is now at 42fa8b2 Merge pull request #2 from pdgendt/zephyr
=== updating net-tools (tools/net-tools):
HEAD is now at e0828aa net-setup: Add instructions for NAT and routing
=== updating nrf_hw_models (modules/bsim_hw_models/nrf_hw_models):
HEAD is now at 33b999d GPIO: Fix bug during initialization
=== updating open-amp (modules/lib/open-amp):
HEAD is now at c904a01 lib: update open-amp lib to release v2023.04.0
=== updating openthread (modules/lib/openthread):
HEAD is now at 37fb77098 [cli] fix `childsupervision` commands documentation (#9201)
=== updating picolibc (modules/lib/picolibc):
HEAD is now at d07c38ff0 Version 1.8.2
=== updating segger (modules/debug/segger):
HEAD is now at 5792675 segger: update to 3.40
=== updating sof (modules/audio/sof):
HEAD is now at ffbf9c2a6 Merge pull request #27 from nashif/cache_api
=== updating tflite-micro (modules/lib/tflite-micro):
HEAD is now at 9156d05 tflite-micro: fix module name
=== updating tinycrypt (modules/crypto/tinycrypt):
HEAD is now at 3e9a49d cmake: Fix conditional in root CMakeLists.txt
=== updating TraceRecorderSource (modules/debug/TraceRecorder):
HEAD is now at bc839bf zephyr: update SYS_INIT calls
=== updating trusted-firmware-m (modules/tee/tf-m/trusted-firmware-m):
HEAD is now at 79a6115d3 Build: Install crypto_platform.h for Crypto service
=== updating trusted-firmware-a (modules/tee/tf-a/trusted-firmware-a):
HEAD is now at 421dc0502 Merge pull request #3 from SgrrZhf/topics/huizha01/2.9.0-updates
=== updating tf-m-tests (modules/tee/tf-m/tf-m-tests):
HEAD is now at 0f80a65 Merge pull request #6 from microbuilder/tfm17_limit_attestation
=== updating psa-arch-tests (modules/tee/tf-m/psa-arch-tests):
HEAD is now at 6a17330 dev-apis: platform: Increase timeout for nRF devs
=== updating uoscore-uedhoc (modules/lib/uoscore-uedhoc):
HEAD is now at e892019 Merge pull request #11 from rlubos/zephyr-integration
=== updating zcbor (modules/lib/zcbor):
HEAD is now at 67fd8bb Update version from 0.6.99 to 0.7.0
=== updating zscilib (modules/lib/zscilib):
HEAD is now at 34c3432 Merge pull request #50 from microbuilder/ztest_fixes
=== updating thrift (modules/lib/thrift):
HEAD is now at 10023645a Merge pull request #1 from zephyrproject-rtos/add-zephyr-module-dir
$ ./scripts/twister -O arc64-bug -xCONFIG_PICOLIBC=y -p qemu_arc_hs5x --test samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization
Install tabulate python module with pip to use --device-testing option.
Renaming output directory to arc64-bug.6
INFO - Using Ninja..
INFO - Zephyr version: zephyr-v3.4.0-1074-g18636af6d665
INFO - Using 'zephyr' toolchain.
INFO - Building initial testsuite list...
INFO - Writing JSON report arc64-bug/testplan.json
INFO - JOBS: 64
INFO - Adding tasks to the queue...
INFO - Added initial list of jobs to queue
ERROR - qemu_arc_hs5x samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization FAILED : Timeout
ERROR - see: /build/keithp/zephyrproject/zephyr/arc64-bug/qemu_arc_hs5x/samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
INFO - Total complete: 1/ 1 100% skipped: 0, failed: 1, error: 0
INFO - 1 test scenarios (1 test instances) selected, 0 configurations skipped (0 by static filter, 0 at runtime).
INFO - 0 of 1 test configurations passed (0.00%), 1 failed, 0 errored, 0 skipped with 0 warnings in 66.81 seconds
INFO - In total 1 test cases were executed, 0 skipped on 1 out of total 584 platforms (0.17%)
INFO - 1 test configurations executed on platforms, 0 test configurations were only built.
INFO - Saving reports...
INFO - Writing JSON report /build/keithp/zephyrproject/zephyr/arc64-bug/twister.json
INFO - Writing xunit report /build/keithp/zephyrproject/zephyr/arc64-bug/twister.xml...
INFO - Writing xunit report /build/keithp/zephyrproject/zephyr/arc64-bug/twister_report.xml...
INFO - -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
INFO - The following issues were found (showing the top 10 items):
INFO - 1) samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization on qemu_arc_hs5x failed (Timeout)
INFO -
INFO - To rerun the tests, call twister using the following commandline:
INFO - west twister -p <PLATFORM> -s <TEST ID>, for example:
INFO -
INFO - west twister -p qemu_arc_hs5x -s samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization
INFO - or with west:
INFO - west build -p -b qemu_arc_hs5x -T samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization
INFO - -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
INFO - Run completed
$ cat /build/keithp/zephyrproject/zephyr/arc64-bug/qemu_arc_hs5x/samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
*** Booting Zephyr OS build zephyr-v3.4.0-1074-g18636af6d665 ***
Did a bit more debugging today. With the code as above, the fault happens when vprintk
is called from boot_banner
.
(gdb) x/20i vprintk
0x80400214 <vprintk>: st.aw r15,[sp,-4]
0x80400218 <vprintk+4>: st.aw r14,[sp,-4]
0x8040021c <vprintk+8>: st.aw blink,[sp,-4]
0x80400220 <vprintk+12>: sub3 sp,sp,0x2
0x80400224 <vprintk+16>: mov_s r14,r0
0x80400226 <vprintk+18>: mov_s r15,r1
0x80400228 <vprintk+20>: mov_s r2,0x10
0x8040022a <vprintk+22>: mov_s r1,0
0x8040022c <vprintk+24>: bl.d 27444 ;0x80406d60 <memset>
0x80400230 <vprintk+28>: mov_s r0,sp
0x80400232 <vprintk+30>: stb 2,[sp,2]
0x80400236 <vprintk+34>: mov_s r0,0x804001fc
0x8040023c <vprintk+40>: mov_s r2,r15
0x8040023e <vprintk+42>: st.as r0,[sp,1]
0x80400242 <vprintk+46>: mov_s r1,r14
0x80400244 <vprintk+48>: bl.d 27708 ;0x80406e80 <vfprintf>
0x80400248 <vprintk+52>: mov_s r0,sp
0x8040024a <vprintk+54>: add3 sp,sp,0x2
0x8040024e <vprintk+58>: ld.ab blink,[sp,4]
0x80400252 <vprintk+62>: ld.ab r14,[sp,4]
When I place a break point at 0x8040024a (right after vfprintf returns), I notice a difference in register contents depending upon whether I also place a break point at 0x80400242 (right before vprintf is called). When the system stops at 242 and is then continued, the application works correctly. When there is no break point at 242, the application hangs, having branched to address 0x0.
The register contents at 0x8040024a when the application works looks like this:
r0 0x41 65
r1 0x804087c9 -2143254583
r2 0x80408980 -2143254144
r3 0x80409c4c -2143249332
r4 0xffffffff -1
r5 0x0 0
r6 0x80408a80 -2143253888
r7 0xffffffff -1
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0xaaaaaaaa -1431655766
r14 0x80408788 -2143254648
r15 0x80409c5c -2143249316
r16 0x0 0
r17 0x0 0
r18 0x80409c9c -2143249252
r19 0xaaaaaaaa -1431655766
r20 0xaaaaaaaa -1431655766
r21 0xaaaaaaaa -1431655766
r22 0xaaaaaaaa -1431655766
r23 0xaaaaaaaa -1431655766
r24 0xaaaaaaaa -1431655766
r25 0xaaaaaaaa -1431655766
r26 0xaaaaaaaa -1431655766
fp 0xaaaaaaaa -1431655766
sp 0x80409c3c -2143249348
ilink 0x80400244 -2143288764
gp 0x80409c94 -2143249260
blink 0x8040024a -2143288758
acc0 0xaaaaaaaa -1431655766
pcl 0x80400248 -2143288760
pc 0x8040024a 0x8040024a <vprintk+54>
status32 0x7ffe8008081c [ E=14 Z RB=0 AD IE ]
bta 0x8040024a 0x8040024a <vprintk+54>
eret 0x0 0x0
timer_build 0x7ffe00000304 [ version=4 t0 t1 p0=0 p1=0 ]
irq_build 0xf141901 [ version=1 IRQs=25 exts=20 p=15 ]
vecbase_build 0x10 [ version=4 addr=0 ]
isa_config 0x12e47402 [ version=2 va_size=4 pa_size=4 a=3 n m=2 c=0 d=1 ]
timer_count0 0x7fffcaaf 2147469999
timer_ctrl0 0x3 [ ie nh ]
timer_limit0 0x7fffcaaf 2147469999
timer_count1 0x80000927 -2147481305
timer_ctrl1 0x0 [ ]
timer_limit1 0xffffff 16777215
erstatus 0x0 [ e=0 rb=0 ]
erbta 0x0 0x0
ecr 0x0 [ parameter=0 causecode=0 vectornumber=0 ]
efa 0x7ffe00000000 140728898420736
icause 0x0 0
aux_irq_ctrl 0x207 [ nr=7 b ]
aux_irq_act 0x0 [ active=0 ]
irq_priority_pending 0x0 0
aux_irq_hint 0x0 0
irq_select 0x10 16
irq_enable 0x1 true
irq_trigger 0x0 false
irq_status 0x7ffe00000010 [ p=0 e ]
irq_pulse_cancel 0x0 false
irq_priority 0x0 0
mmu_ctrl {en = 0x0, ku = 0x0, wx = 0x0, tce = 0x0} {en = false, ku = false, wx = false, tce = false}
mmu_rtp0 {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}
mmu_rtp1 {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}
the register contents at 0x8040024a when the application fails looks like this:
r0 0x41 65
r1 0x804087c9 -2143254583
r2 0x80408980 -2143254144
r3 0x80409c4c -2143249332
r4 0xffffffff -1
r5 0x0 0
r6 0x80408a80 -2143253888
r7 0xffffffff -1
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0xaaaaaaaa -1431655766
r14 0x80408788 -2143254648
r15 0x80409c5c -2143249316
r16 0x0 0
r17 0x0 0
r18 0x80409c9c -2143249252
r19 0xaaaaaaaa -1431655766
r20 0xaaaaaaaa -1431655766
r21 0xaaaaaaaa -1431655766
r22 0xaaaaaaaa -1431655766
r23 0xaaaaaaaa -1431655766
r24 0xaaaaaaaa -1431655766
r25 0xaaaaaaaa -1431655766
r26 0xaaaaaaaa -1431655766
fp 0xaaaaaaaa -1431655766
sp 0x80409c38 -2143249352
ilink 0x8040024a -2143288758
gp 0x80409c94 -2143249260
blink 0x8040024a -2143288758
acc0 0xaaaaaaaa -1431655766
pcl 0x8040024a -2143288758
pc 0x8040024a 0x8040024a <vprintk+54>
status32 0x7ffc8008081c [ E=14 Z RB=0 AD IE ]
bta 0x80401570 0x80401570 <_rirq_exit+72>
eret 0x0 0x0
timer_build 0x7ffc00000304 [ version=4 t0 t1 p0=0 p1=0 ]
irq_build 0xf141901 [ version=1 IRQs=25 exts=20 p=15 ]
vecbase_build 0x10 [ version=4 addr=0 ]
isa_config 0x12e47402 [ version=2 va_size=4 pa_size=4 a=3 n m=2 c=0 d=1 ]
timer_count0 0x7fffcaaf 2147469999
timer_ctrl0 0x3 [ ie nh ]
timer_limit0 0x7fffcaaf 2147469999
timer_count1 0x80000927 -2147481305
timer_ctrl1 0x0 [ ]
timer_limit1 0xffffff 16777215
erstatus 0x0 [ e=0 rb=0 ]
erbta 0x0 0x0
ecr 0x0 [ parameter=0 causecode=0 vectornumber=0 ]
efa 0x7ffc00000000 140720308486144
icause 0x0 0
aux_irq_ctrl 0x207 [ nr=7 b ]
aux_irq_act 0x0 [ active=0 ]
irq_priority_pending 0x0 0
aux_irq_hint 0x0 0
irq_select 0x10 16
irq_enable 0x1 true
irq_trigger 0x0 false
irq_status 0x7ffc00000010 [ p=0 e ]
irq_pulse_cancel 0x0 false
irq_priority 0x0 0
mmu_ctrl {en = 0x0, ku = 0x0, wx = 0x0, tce = 0x0} {en = false, ku = false, wx = false, tce = false}
mmu_rtp0 {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}
mmu_rtp1 {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}
Checking the differences, we see:
--- reg-working 2023-07-13 11:54:43.495966032 -0700
+++ reg-broken 2023-07-13 11:26:44.252615552 -0700
@@ -27,17 +27,17 @@
r25 0xaaaaaaaa -1431655766
r26 0xaaaaaaaa -1431655766
fp 0xaaaaaaaa -1431655766
-sp 0x80409c3c -2143249348
-ilink 0x80400244 -2143288764
+sp 0x80409c38 -2143249352
+ilink 0x8040024a -2143288758
gp 0x80409c94 -2143249260
blink 0x8040024a -2143288758
acc0 0xaaaaaaaa -1431655766
-pcl 0x80400248 -2143288760
+pcl 0x8040024a -2143288758
pc 0x8040024a 0x8040024a <vprintk+54>
-status32 0x7ffe8008081c [ E=14 Z RB=0 AD IE ]
-bta 0x8040024a 0x8040024a <vprintk+54>
+status32 0x7ffc8008081c [ E=14 Z RB=0 AD IE ]
+bta 0x80401570 0x80401570 <_rirq_exit+72>
eret 0x0 0x0
-timer_build 0x7ffe00000304 [ version=4 t0 t1 p0=0 p1=0 ]
+timer_build 0x7ffc00000304 [ version=4 t0 t1 p0=0 p1=0 ]
irq_build 0xf141901 [ version=1 IRQs=25 exts=20 p=15 ]
vecbase_build 0x10 [ version=4 addr=0 ]
isa_config 0x12e47402 [ version=2 va_size=4 pa_size=4 a=3 n m=2 c=0 d=1 ]
@@ -50,7 +50,7 @@
erstatus 0x0 [ e=0 rb=0 ]
erbta 0x0 0x0
ecr 0x0 [ parameter=0 causecode=0 vectornumber=0 ]
-efa 0x7ffe00000000 140728898420736
+efa 0x7ffc00000000 140720308486144
icause 0x0 0
aux_irq_ctrl 0x207 [ nr=7 b ]
aux_irq_act 0x0 [ active=0 ]
@@ -59,7 +59,7 @@
irq_select 0x10 16
irq_enable 0x1 true
irq_trigger 0x0 false
-irq_status 0x7ffe00000010 [ p=0 e ]
+irq_status 0x7ffc00000010 [ p=0 e ]
irq_pulse_cancel 0x0 false
irq_priority 0x0 0
mmu_ctrl {en = 0x0, ku = 0x0, wx = 0x0, tce = 0x0} {en = false, ku = false, wx = false, tce = false}
Critically important here is that sp
is different. In the failing case, when blink
is loaded from the stack, the value is 0, causing the system to branch to 0x0.
If I reset sp
and ilink
to the correct values at this point, then the application works correctly.
Ok, after even more debugging, I think I've discovered a bug in qemu?
Here's the last two instructions in vfprintf:
0x80406f10 <vfprintf+136>: j_s.d [blink]
0x80406f12 <vfprintf+138>: ld.ab r25,[sp,4]
When stopped at the first one, blink
points at the return address in vprintk:
0x8040023e <vprintk+42>: st.as r0,[sp,1]
0x80400242 <vprintk+46>: mov_s r1,r14
0x80400244 <vprintk+48>: bl.d 27716 ;0x80406e88 <vfprintf>
0x80400248 <vprintk+52>: mov_s r0,sp
=> 0x8040024a <vprintk+54>: add3 sp,sp,0x2
0x8040024e <vprintk+58>: ld.ab blink,[sp,4]
0x80400252 <vprintk+62>: ld.ab r14,[sp,4]
0x80400256 <vprintk+66>: j_s.d [blink]
0x80400258 <vprintk+68>: ld.ab r15,[sp,4]
I stick a break point on both addresses along with _isr_wrapper
, stop at the first to wait for a timer interrupt to pend. At this point, sp
is 0x80409c40. Continue, and the timer interrupt fires causing the processor to handle the interrupt. At this point, sp
is 0x80409c00. That means it moved the stack by 16 values, but there are 17 values to store here -- 14 registers, blink, pc and status32. Looking at memory at the old stack location we find that qemu has actually smashed that value, storing status32 there.
When the rtie instruction executes, it pops all 17 values from the stack, leaving sp
at 0x80409c44, which doesn't match the value from before the interrupt was handled.
ok, I'm pretty sure this is a bug in qemu caused by combining an interrupt and a jump with a delay slot. In this bug, the timer fires exactly as vfprintf
is in the middle of returning, in this sequence:
0x80406f08 <vfprintf+128>: ld.ab r23,[sp,4]
0x80406f0c <vfprintf+132>: ld.ab r24,[sp,4]
0x80406f10 <vfprintf+136>: j_s.d [blink]
0x80406f12 <vfprintf+138>: ld.ab r25,[sp,4]
The instruction in the delay slot increments the stack pointer. If the timer fires right on top of the j_s.d [blink]
instruction, then the stack pointer gets mangled for the next few instructions. If you place a break point before that happens, even a single instruction, then the timer will expire while the breakpoint is pending and fire immediately after you continue execution, so the interrupt occurs before the branch and things are just fine.
I'm thinking we should just disable branch delay slots in the qemu targets; they're certainly not an optimization for this platform and it seems that qemu doesn't have them quite right. When I do that, everything is just fine.
Ok, looks like this qemu bug is fixed upstream and we just need a newer version in the SDK to get this to work.
Reopen as we haven't switched to new SDK yet (but the qequired fixes are merged to SDK repo)
Re-opening as a reminder to revert the workaround when the new SDK with qemu fix is used in CI.
Describe the bug When running the sample.portability.cmsis_rtos_v2.timer_synchronization test on arc64 under qemu using picolibc,
z_init_static_threads
goes into an infinite loop in the second instance of_FOREACH_STATIC_THREAD
where it's callingschedule_new_thread
.Debugging this, I discover that the register holding the termination value has been trashed and holds a completely bogus value (0x4275de2 instead of 0x80407594). Perturbing the code above the loop in essentially any way at all (I added
__asm__("")
just afterk_sched_lock()
) makes this issue go away. The difference in the assembly code is trivial -- it moves the load of__static_thread_data_list_start
after the call tok_sched_lock
Please also mention any information which could help others to understand the problem you're facing:
What target platform are you using?
What have you tried to diagnose or workaround this issue?
Modifying the code in even trivial ways appears to work around this issue. I applied this diff:
This generated the following change in the generated code:
setting a break point on the
cmp_s
instruction here also makes things work. That makes me suspect that qemu might be at fault somehow?To Reproduce Steps to reproduce the behavior:
Expected behavior Test completes successfully
Impact This will delay the switch to using picolibc as the default C library.
Environment (please complete the following information):