zephyriot / zephyr-issues

0 stars 0 forks source link

sporadic bad RAM pointer error under qemu_nios2 #640

Open nashif opened 8 years ago

nashif commented 8 years ago

Reported by Andrew Boie:

08:55:56 ---------------------sanity-out/qemu_nios2/tests/kernel/test_mutex/test/run.log--------------------- 08:55:56 make[1]: Entering directory /jenkins/workspace/zephyr-verify/tests/kernel/test_mutex' 08:55:56 make[2]: Entering directory/jenkins/workspace/zephyr-verify' 08:55:56 make[3]: Entering directory /jenkins/workspace/zephyr-verify/sanity-out/qemu_nios2/tests/kernel/test_mutex/test' 08:55:56 Using /jenkins/workspace/zephyr-verify as source for kernel 08:55:56 GEN ./Makefile 08:55:56 CHK include/generated/version.h 08:55:56 CHK misc/generated/configs.c 08:55:56 CHK include/generated/offsets.h 08:55:56 CHK misc/generated/sysgen/prj.mdef 08:55:56 [QEMU] CPU: nios2 08:55:56 QEMU 2.1.3 monitor - type 'help' for more information 08:55:56 (qemu) Bad ram pointer (nil) 08:55:56 make[3]: *** [qemu] Aborted (core dumped) 08:55:56 make[3]: Leaving directory/jenkins/workspace/zephyr-verify/sanity-out/qemu_nios2/tests/kernel/test_mutex/test' 08:55:56 make[2]: [sub-make] Error 2 08:55:56 make[2]: Target qemu' not remade because of errors. 08:55:56 make[2]: Leaving directory/jenkins/workspace/zephyr-verify' 08:55:56 make[1]: [qemu] Error 2 08:55:56 make[1]: Leaving directory `/jenkins/workspace/zephyr-verify/tests/kernel/test_mutex' 09:02:53 ---------------------sanity-out/qemu_nios2/tests/kernel/test_mutex/test/run.log---------------------

This issue is only intermittently reproducible. It's not known whether this can affect any test or is specific to test_mutex. The crash appears to happen before any console output is displayed.

(Imported from Jira ZEP-678)

nashif commented 8 years ago

by Andrew Boie:

nashif commented 8 years ago

by Andrew Boie:

Juro Bystricky this is difficult to debug since it's not know where the null pointer dereference is taking place, it just prints the bad pointer value and coredumps. How hard would it be for it to dump registers (at least the program counter) and not explode?

nashif commented 8 years ago

by Javier B Perez:

qemu_nios2 tests/kernel/test_pipe_priv/test FAILED: build_error -------------------sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test/run.log------------------- make[1]: Entering directory /jenkins/workspace/zephyr-merge/tests/kernel/test_pipe_priv' make[2]: Entering directory/jenkins/workspace/zephyr-merge' make[3]: Entering directory /jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test' Using /jenkins/workspace/zephyr-merge as source for kernel GEN ./Makefile CHK include/generated/version.h CHK misc/generated/configs.c CHK include/generated/offsets.h CHK misc/generated/sysgen/prj.mdef [QEMU] CPU: nios2 QEMU 2.1.3 monitor - type 'help' for more information (qemu) Bad ram pointer (nil) make[3]: *** [qemu] Aborted (core dumped) make[3]: Leaving directory/jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test' make[2]: [sub-make] Error 2 make[2]: Target qemu' not remade because of errors. make[2]: Leaving directory/jenkins/workspace/zephyr-merge' make[1]: [qemu] Error 2 make[1]: Leaving directory `/jenkins/workspace/zephyr-merge/tests/kernel/test_pipe_priv' -------------------sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test/run.log-------------------

nashif commented 8 years ago

by Javier B Perez:

qemu_nios2 tests/kernel/test_sema/microkernel/test FAILED: build_error ---------------sanity-out/qemu_nios2/tests/kernel/test_sema/microkernel/test/run.log---------------- make[1]: Entering directory /jenkins/workspace/zephyr-merge/tests/kernel/test_sema/microkernel' make[2]: Entering directory/jenkins/workspace/zephyr-merge' make[3]: Entering directory /jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_sema/microkernel/test' Using /jenkins/workspace/zephyr-merge as source for kernel GEN ./Makefile CHK include/generated/version.h CHK misc/generated/configs.c CHK include/generated/offsets.h CHK misc/generated/sysgen/prj.mdef [QEMU] CPU: nios2 QEMU 2.1.3 monitor - type 'help' for more information (qemu) Bad ram pointer (nil) make[3]: *** [qemu] Aborted (core dumped) make[3]: Leaving directory/jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_sema/microkernel/test' make[2]: [sub-make] Error 2 make[2]: Target qemu' not remade because of errors. make[2]: Leaving directory/jenkins/workspace/zephyr-merge' make[1]: [qemu] Error 2 make[1]: Leaving directory `/jenkins/workspace/zephyr-merge/tests/kernel/test_sema/microkernel' ---------------sanity-out/qemu_nios2/tests/kernel/test_sema/microkernel/test/run.log----------------

nashif commented 8 years ago

by Javier B Perez:

qemu_nios2 tests/kernel/test_events/test FAILED: build_error --------------------sanity-out/qemu_nios2/tests/kernel/test_events/test/run.log--------------------- make[1]: Entering directory /jenkins/workspace/zephyr-merge/tests/kernel/test_events' make[2]: Entering directory/jenkins/workspace/zephyr-merge' make[3]: Entering directory /jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_events/test' Using /jenkins/workspace/zephyr-merge as source for kernel GEN ./Makefile CHK include/generated/version.h CHK misc/generated/configs.c CHK include/generated/offsets.h CHK misc/generated/sysgen/prj.mdef [QEMU] CPU: nios2 QEMU 2.1.3 monitor - type 'help' for more information (qemu) Bad ram pointer (nil) make[3]: *** [qemu] Aborted (core dumped) make[3]: Leaving directory/jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_events/test' make[2]: [sub-make] Error 2 make[2]: Target qemu' not remade because of errors. make[2]: Leaving directory/jenkins/workspace/zephyr-merge' make[1]: [qemu] Error 2 make[1]: Leaving directory `/jenkins/workspace/zephyr-merge/tests/kernel/test_events' --------------------sanity-out/qemu_nios2/tests/kernel/test_events/test/run.log---------------------

nashif commented 8 years ago

by Javier B Perez:

qemu_nios2 tests/kernel/test_task_priv/test FAILED: build_error -------------------sanity-out/qemu_nios2/tests/kernel/test_task_priv/test/run.log------------------- make[1]: Entering directory /jenkins/workspace/zephyr-merge/tests/kernel/test_task_priv' make[2]: Entering directory/jenkins/workspace/zephyr-merge' make[3]: Entering directory /jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_task_priv/test' Using /jenkins/workspace/zephyr-merge as source for kernel GEN ./Makefile CHK include/generated/version.h CHK misc/generated/configs.c CHK include/generated/offsets.h CHK misc/generated/sysgen/prj.mdef [QEMU] CPU: nios2 QEMU 2.1.3 monitor - type 'help' for more information (qemu) Bad ram pointer (nil) make[3]: *** [qemu] Aborted (core dumped) make[3]: Leaving directory/jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_task_priv/test' make[2]: [sub-make] Error 2 make[2]: Target qemu' not remade because of errors. make[2]: Leaving directory/jenkins/workspace/zephyr-merge' make[1]: [qemu] Error 2 make[1]: Leaving directory `/jenkins/workspace/zephyr-merge/tests/kernel/test_task_priv' -------------------sanity-out/qemu_nios2/tests/kernel/test_task_priv/test/run.log-------------------

nashif commented 8 years ago

by Andrew Boie:

Also seen in test_pipe_priv

-------------------sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test/run.log-------------------
make[1]: Entering directory `/jenkins/workspace/zephyr-merge/tests/kernel/test_pipe_priv'
make[2]: Entering directory `/jenkins/workspace/zephyr-merge'
make[3]: Entering directory `/jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test'
  Using /jenkins/workspace/zephyr-merge as source for kernel
  GEN     ./Makefile
  CHK     include/generated/version.h
  CHK     misc/generated/configs.c
  CHK     include/generated/offsets.h
  CHK     misc/generated/sysgen/prj.mdef
[QEMU] CPU: nios2
QEMU 2.1.3 monitor - type 'help' for more information
(qemu) Bad ram pointer (nil)
make[3]: *** [qemu] Aborted (core dumped)
make[3]: Leaving directory `/jenkins/workspace/zephyr-merge/sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test'
make[2]: *** [sub-make] Error 2
make[2]: Target `qemu' not remade because of errors.
make[2]: Leaving directory `/jenkins/workspace/zephyr-merge'
make[1]: *** [qemu] Error 2
make[1]: Leaving directory `/jenkins/workspace/zephyr-merge/tests/kernel/test_pipe_priv'
-------------------sanity-out/qemu_nios2/tests/kernel/test_pipe_priv/test/run.log-------------------
nashif commented 8 years ago

by Andrew Boie:

The "(qemu) Bad ram pointer (nil)" situation can be reproduced by trying to use a NULL function pointer, such as follows:

$ git diff
diff --git a/samples/hello_world/nanokernel/src/main.c b/samples/hello_world/nanokernel/src/main.c
index 3cdd94f..8e83394 100644
--- a/samples/hello_world/nanokernel/src/main.c
+++ b/samples/hello_world/nanokernel/src/main.c
@@ -18,6 +18,9 @@
 #define SYS_LOG_LEVEL SYS_LOG_LEVEL_INFO
 #include <misc/sys_log.h>

+
+void (*func)(int) = NULL;
+
 /*
  * @file
  * @brief Hello World demo
@@ -28,5 +31,7 @@
 void main(void)
 {
        SYS_LOG_INF("Hello World! %s", CONFIG_ARCH);
+
+       func(32);
 }

Since QEMU aborts, you can't catch this scenario in a debugger, the remote connection gets closed. I do believe I have seen this issue on real hardware. It is also difficult to debug, what happens is that control transfers to the NULL pointer, which also happens to be the reset vector, and the device effectively reboots.

nashif commented 8 years ago

by Andrew Boie:

I have a patch which attempts to print the faulting instructions when this happens.

From 24d4321a5efd45e438b62b9dfa3b98d934fe6f59 Mon Sep 17 00:00:00 2001
From: Andrew Boie <andrew.p.boie@intel.com>
Date: Thu, 11 Aug 2016 14:11:51 -0700
Subject: [PATCH] nios2: dump SP, RA, PC upon bad function pointer deref

We've found that QEMU aborts in qemu_ram_addr_from_host_nofail()
if the CPU does a 'callr' instruction to a bad memory address, such
as NULL. Since the emulator crashes this makes finding the cause
difficult.

Now we dump out a few interesting registers, in particular RA should
contain the address of the bad 'callr' + 4.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
---
 cputlb.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index afd3705..3c90a9e 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -145,12 +145,17 @@ void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry, uintptr_t start,
     }
 }

-static inline ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
+static inline ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr,
+                           CPUArchState *env1)
 {
     ram_addr_t ram_addr;

     if (qemu_ram_addr_from_host(ptr, &ram_addr) == NULL) {
         fprintf(stderr, "Bad ram pointer %p\n", ptr);
+#if defined(TARGET_NIOS2)
+   fprintf(stderr, "SP: 0x%08x RA: 0x%08x PC: 0x%08x\n", env1->regs[R_SP],
+       env1->regs[R_RA], env1->regs[R_PC]);
+#endif
         abort();
     }
     return ram_addr;
@@ -329,7 +334,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
         }
     }
     p = (void *)((uintptr_t)addr + env1->tlb_table[mmu_idx][page_index].addend);
-    return qemu_ram_addr_from_host_nofail(p);
+    return qemu_ram_addr_from_host_nofail(p, env1);
 }

 #define MMUSUFFIX _mmu
-- 
2.7.4

This works for my test application where I intentionally use a NULL function pointer. Now I'm running sanitycheck in a loop to try to get the failure to reproduce, I should be able to tell where in the codebase it originates once it fails.

nashif commented 8 years ago

by Andrew Boie:

After running sanitycheck in a loop for 24 hours on my workstation with the following script I still cannot reproduce this bug. Juro took my patch which dumps regs when QEMU aborts, so I think we'll have to wait until the next SDK goes live on the CI servers and look at the errors then.

set -e

while true
do
    sanitycheck --arch=nios2 -i
done
nashif commented 8 years ago

by Mark Linkmeyer:

Clearing the Fix Version/s field and lowering the priority to Medium per review in today's Execution Checkpoint meeting.

nashif commented 8 years ago

by Javier B Perez:

Andrew Boie I identified more random issues:

https://logs.zephyrproject.org/production/vex-yul-zep-jenkins-1/zephyr-verify/ Builds: 10879, 10884, 10892, 10932, 10940, 10943, 10944, 10958, 10963, 10975, 10979, 10984, 11006, 11015, 11017, 11018, 11029, 11030, 11043, 11056, 11057, 11058, 11062, 11080, 11089, 11091, 11096, 11098, 11100, 11101, 11113, 11122

make[1]: Entering directory `/jenkins/workspace/zephyr-verify/tests/legacy/kernel/test_sleep'
make[2]: Entering directory `/jenkins/workspace/zephyr-verify'
make[3]: Entering directory `/jenkins/workspace/zephyr-verify/sanity-out/qemu_nios2/tests/legacy/kernel/test_sleep/test'
  Using /jenkins/workspace/zephyr-verify as source for kernel
  GEN     ./Makefile
  CHK     include/generated/version.h
  CHK     misc/generated/configs.c
  CHK     include/generated/offsets.h
  CHK     misc/generated/sysgen/prj.mdef
  AR      libzephyr.a
  LINK    zephyr.lnk
  HEX     zephyr.hex
  BIN     zephyr.bin
[QEMU] CPU: nios2
QEMU 2.1.3 monitor - type 'help' for more information
(qemu) Bad ram pointer 0x20
make[3]: *** [qemu] Aborted (core dumped)
make[3]: Leaving directory `/jenkins/workspace/zephyr-verify/sanity-out/qemu_nios2/tests/legacy/kernel/test_sleep/test'
make[2]: *** [sub-make] Error 2
make[2]: Target `qemu' not remade because of errors.
make[2]: Leaving directory `/jenkins/workspace/zephyr-verify'
make[1]: *** [qemu] Error 2
make[1]: Leaving directory `/jenkins/workspace/zephyr-verify/tests/legacy/kernel/test_sleep'

Builds: 10886, 10887, 11028, 11098, 11124, 11004, 11028, 11001

make[3]: Entering directory `/jenkins/workspace/zephyr-verify/sanity-out/qemu_nios2/tests/legacy/kernel/test_fifo_priv/test'
  Using /jenkins/workspace/zephyr-verify as source for kernel
  GEN     ./Makefile
  CHK     include/generated/version.h
  CHK     misc/generated/configs.c
  CHK     include/generated/offsets.h
  CHK     misc/generated/sysgen/prj.mdef
  AR      libzephyr.a
  LINK    zephyr.lnk
  HEX     zephyr.hex
  BIN     zephyr.bin
[QEMU] CPU: nios2
QEMU 2.1.3 monitor - type 'help' for more information
(qemu) Bad ram pointer (nil)
make[3]: *** [qemu] Aborted (core dumped)

Tests: tests/legacy/kernel/test_sleep/ tests/legacy/kernel/test_fifo_priv/ tests/legacy/kernel/test_sema/microkernel tests/legacy/kernel/test_pool/test_nios2 tests/legacy/kernel/test_fifo/microkernel

nashif commented 7 years ago

by Mark Linkmeyer:

Fixing incorrect priority