zephyriot / zep-jira14

0 stars 0 forks source link

tests/ztest/test/base/testcase.ini#test_verbose_2 FAILS on EMSK #1195

Closed nashif closed 7 years ago

nashif commented 7 years ago

Reported by Inaky Perez-Gonzalez:

(Platform EMSK not available on Ref Platform list)

When running testcases on EMSK, the output of this one is always broken up; I was able to capture:


***********************************

**       SynopsysRunning test suite framework_tests

other times it was:


***********************************Running test suite framework_tests
Exception vector: 0x00000008, cause code: 0x00000000, parameter 0x00000033
Address

***********************************Running test suite framework_tests
Exception vector: 0x00000008, cause code: 0x00000000, parameter 0x00000033
Address 0x80002196
Fatal fault in essential thread! Spinning...

v1.6.0-branch $ git checkout 8f0b4d7f4d045686e897d1c762d2b1b382b4e36c

(Imported from Jira ZEP-1317)

nashif commented 7 years ago

by Sharron LIU:

ztest goes to Inaky per ZEP-1227.

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

This is an EMSK specific failure, needs to be triaged.

nashif commented 7 years ago

by Chuck Jordan:

Vector 8 is "instruction error". I wasn't able to reproduce this, but one possibility is the test case was built for one SOC, but the dip switches were set to a different SOC. You need to make sure the if building for EM7D, 9D, or 11D, that the dip switches are set correctly. Could this be the cause?

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

They are set correctly.

Note we run many / multiple TCs on these boards; if that were the case, all should fail the same?

nashif commented 7 years ago

by Chuck Jordan:

Can you provide WHICH SOC you were using, and what the dip switches were set to? Also, it would be useful to see the actual "build" output if possible. Could you attach that?

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

This is the same as GH-1220 in those terms:

{quote} Confirming that this is happening with DIP switch 1 down and CONFIG_SOC_EM9D=y (as it is the deftault config)

I reaffirm: these tests (and others I report off EMSK) happen randomly; we run the whole test suite on them and randomly one or another fails, sometimes not being able to dump the registers. The HW is not touched or altered in between runs (other than power cycling).

All is set to DIP switch 1 and the defconfig has EM9D selected. If this was not the case, all the TCs would be failing. {quote}

Build is impossible to recover, as jenkins ran it and then purged it after collecting the results -- I guess we should mod it so it saves the build output of failed TCs. The system built it with SDK 0.8.2, revision 8f0b4d7f4d045686e897d1c762d2b1b382b4e36c, default config options. That should get you the same thing.

But again, these things I observe them randomly, sometimes they fail, sometimes they do not, different revisios.

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

I am seeing some more like this on other TCs, so I am just going to paste them here instead of opening new one; we probably should consider this a single bug (GH-1220, GH-1218, and this one)

$ git checkout f38cbb57446e675a0df5b1419df30b6fcdfad8f5

tests/legacy/kernel/test_task/testcase.ini#test @ jfsotc03/emsk-25:arc evaluation failed type:emsk9d

output: ***********************************
output:
output: **    tc_start() - Test Microkernel TaskException vector: 0x0000000d, cause code: 0x00000000, parameter 0x00000000
output: Address 0x00000009
output: Fatal fault in thread! Aborting.
output: Exception vector: 0x00000008, cause code: 0x00000000, parameter 0x0000002e
output: Address 0x800033c0
output: Fatal fault in essential thread! Spinning...

TCF: tests/legacy/kernel/test_mutex/testcase.ini#test @ jfsotc03/emsk-25:arc evaluation failed type:emsk9d

output:
output:
output: ***********************************
output:
output: **       Synopsys, Inc.          **
output:
output: **     ARC EM Starter kit        **
output:
output: **                               **
output:
output: ** Comprehensive software stacks **
output:
output: **   available from embARC.org   **
output:
output: **                               **
output:
output: ***********************************
output:
output: Firmware   Jan 12 2016, v2.2
output:
output: Bootloader Dec 29 2015, v1.1
output:
output: tc_start() - Test Microkernel MuException vector: 0x00000002, cause code: 0x00000000, parameter 0x00000000
output: Address 0x80001340
output: Fatal fault in thread! Aborting.
output: Exception vector: 0x00000002, cause code: 0x00000000, parameter 0x00000000
output: Address 0x80001024
output: Fatal fault in essential thread! Spinning...

another three failed in this build, but no regdump was captured. (test_sema_priv, test_map_priv, test_context).

nashif commented 7 years ago

by Chuck Jordan:

I tried to reproduce this with EM9D several more times today. It always passes. However, if I boot up EM7D, and run the 9D code on it, yes I get the Exception vector: 0x00000002 error. In the above output, its too bad we can't see the text line directly under the "Bootloader Dec 29 2015..." line, because that line tells you which SOC was booted up.

I find that on my board, even though DIP switch 1 is down, sometimes it boots up EM7D by mistake. You can see this occur in the banner that is printed here. Can we modify your test scripts to print that one extra line of banner output so we can see what SOC was booted? It will help us in the future while investigating these issues.

Also, btw, if I run the EM9D image on an EM11D, it seems to work. That is because the ICCM and DCCM memories start at the same address, even though they are smaller on EM11D. Also EM9D and EM11 have same sort of interrupt handling. I think the memerr on EM7D occurs because it has DIFFERENT FIRQ handling. So running EM9D code on EM7D will fail.

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

? You can scroll the box down, it says:

...
output: Bootloader Dec 29 2015, v1.1
output:
output: tc_start() - Test Microkernel MuException vector: 0x00000002, cause code: 0x00000000, parameter 0x00000000
output: Address 0x80001340
...

the system captures everything that comes off the serial port, this is a verbatim dump of it.

Talk to me more about that 'booting by mistake' -- which are other ways we can determine which mode it actually booted on? Can we do some kind of runtime assert?

nashif commented 7 years ago

by Chuck Jordan:

As discussed during the Zephyr Summit, it might be best to switch to the CONFIG_SOC_EM7D=y, choice, and to not use EM9D. The problem is likely a mechanical dip-switch issue where even though bit 1 is in the down position, the contact is not being established, so it boots up EM7D instead after a power cycle. With EM7D, and all dip-switches UP, no contact is needed and so success should occur. In the master branch, I could switch to EM7D as the default. We also were trying to figure out how to pass, on the command-line, arguments to make it select EM7D at build time. If that isn't easy, I think the conf file might need to do it.

nashif commented 7 years ago

by Chuck Jordan:

I've switched to em7d as the default, all dip switches up, in the master branch. See Change 8962. Do you need this for 1.6 too?

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

I am thinking we might not be doing the right thing here.

Shouldn't each be a different board? boards/arc/em_starter_kit{7,9,10,11}? I do not undestand why this is a CONFIG option.

nashif commented 7 years ago

by Chuck Jordan:

There is a single board that can boot up one of 3 different SOC's under dip-switch control. So one board, but 3 different SOC choices. Unlike other boards that have the SOC in ASIC form, the em_starterkit board has a solution where the SOC itself can be loaded at boot time. The SOC, too, therefore, can be upgraded with a new hardware design in this way. We are currently at version 2.2 of these FPGA images. We are discussing version 3.0 which will come out in 2017 some time. No board change, all the changes are within the SOC.

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

I understand -- conceptually then, from the Zephyr standpoint, it could be seen as three different boards, as they sport each a different SOC (even if physically it is the same piece of HW).

If we meld in the device tree discussion from the other day, it is also reasonable to say that each SOC might have slightly different DT information.

I think conceptually we shall treat them as different boards from the Zephyr standpoint -- this would simplify greatly the configuration process.

nashif commented 7 years ago

by Chuck Jordan:

Zephyr is organized to have a "board" directory with material related to the board. Under "arch/arc", you have CPU related STUFF, and there is an "soc" subdir here for each UNIQUE soc. Really this isn't exactly correct in that an SOC contains one or more CPU. So probably the CPU shouldn't have SOC subdirs. But given this structure, these SOC subdirs are the places to express various configuration that is special about the CPU including memories it might have that are closely coupled, or other CPU related features. In terms of thinking in OBJECT-ORIENTATION with classes, a board is a class that can contain SOCs. SOCs are a class that can contain all sorts of things, controllers, special custom IP, I/o interfaces, CPUs, memories, etc. etc. The ARC CPUs have families, but within a given family, there is a huge amount of configurability. So within an SOC, if you instantiate ONE CPU, you want to know what features that one CPU has to know what switches to pass to the compiler. Each user of Zephr and ARC might have a custom ARC CPU. As it stands now, they would have to HACK things in these arch/arc dirs if there are differences. But they wouldn't have to touch board @ all. Same board. The SOC is a loadable object, just as code is loadable, since an FPGA is used. So users could load a NEW SOC that they have made custom. There are other boards like this that have FPGAs instead of ASICs.

nashif commented 7 years ago

by Chuck Jordan:

re: device tree discussion We talked about how #include is supported in the device tree. So for example, you could have device tree for base-board, which has knowledge of what is ON that board. The board can include one or more SOC device trees, depending upon how many are on the board. Each SOC could have its device tree. Further, if there are attachable things on the board like connectors for things or daughter cards, or shields, these might be places to do the #include too. You could imagine a shield coming with its own device tree that the board simply includes. Same with connectors. Suppose its a SPI device connector. You don't know what a user will do on that SPI. They might put an OLED display. If they do, they could include the OLED display device tree. So in this way, instead of just ONE FLAT device tree that the user has to edit, you can build up a tree-structure of different files using #include, and these can be in different parts of the Zephyr tree.

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

Hello Chuck

Since we moved the boards to EMSK7d, I am having issues loading firmware on it.

If I load with GDB, thinks work as expected.

However, using OpenOCD, as I had been doing until we moved the default (from emsk9d), it fails consistently.

With openocd, I would use the sequence:

reset halt
load_image zephyr.elf
20440 bytes written at address 0x10000000
downloaded 20440 bytes in 0.047536s (419.912 KiB/s)
resume

and it would work, but not anymore. Is there anything else that is needed in emsk7d?

nashif commented 7 years ago

by Chuck Jordan:

I have to do: arc-elf32-gdb \ -ex "target remote :3333" \ -ex "load" \ -ex "set remotetimeout 10000" \ outdir/em_starterkit/zephyr.elf

A few months back the elf was directly under outdir. Someone change the makefiles to have the board name be a subdir of outdir. That was the only difference that happened a month or so ago. openocd side looks like this:

openocd -c 'gdb_port 3333' -s $ARCGNU_IDE/share/openocd/scripts -f board/snps_em_sk_v2.2.cfg

where ARCGNU_IDE is environment variable set to /usr/local/arc/arc_gnu_2016.03_ide_linux_install

nashif commented 7 years ago

by Chuck Jordan:

btw, tried this test case and it seems to work with em7d.

nashif commented 7 years ago

by Anas Nashif:

Inaky Perez-Gonzalez is this still an issue?

nashif commented 7 years ago

by Inaky Perez-Gonzalez:

I can't verify it -- part of the solution involved moving the HW to always operate in EM7D mode and since then, the automated method we used to load images onto it doesn't work anymore.

Until we have time to dedicate to that, they have been removed from the automated testing pool.

I am ok with closing this, as Chuck confirmed it was fixed--once we are able to read the HW to the pool, if they come up, we can revisit it if needed.