Overhaul and improve test states in twister and document them

nashif commented 1 year ago

Introduction

Statuses of tests in twister have been maintained in free form and we never had a list of defined statuses with clear description of what they mean. This RFC does provide some details about those are being used and some changes to statuses and reporting in general.

Problem description

There are many cases where a test does not pass or is not run for various reasons, we should have a clear end status for each test to help identify real issues and misses in the test execution. First, this calls for a pre-defined set of status with clear description of what they mean, then we need to tweak the reporting to use those the pre-defined statuses and verify consistency across all tests.

We also have some inconsistency rooted in the fact we wanted to keep things backward compatible with original reporting structure, so we end up reporting status for both a testsuite and the included testscases, which can lead to confusion and different ways of counting.

One more issue to be resolved is how we deal with filtered tests and how do we report them. Filtered tests are now being included in reports and during test execution (reported as skipped). For any target we have, the amount of filtered tests is large for good reasons and including filtered tests in console and final reports give the wrong impression about why this large number of tests being skipped. We want to only report tests that are planned and provide a different way for viewing filter status and filter reports and avoid confusion between a test that was filtered and a test that was skipped as part of the execution.

Proposed change

Introduce and use Status class
Define all applicable statuses
In some cases, change what status we use depending on the context

Detailed RFC

The following states will be defined:

class Status:
    PASSED = 'passed'
    FAILED = 'failed'
    ERROR = 'error'
    SKIPPED = 'skipped'
    NOTRUN = 'not run'
    BLOCKED = 'blocked'
    INPROGRESS = 'in progress'

passed: A test is deemed to pass if its actual result matches its expected result.

failed: A test is deemed to fail if its actual result does not match its expected result.

error: A test produces the Error result when there is a problem in running the test itself.

skipped: A step that was skipped during a test script execution.

blocked: A test case that is scheduled to run but cannot run because of an error or a crash in the suite.

not run: A test case or test suite that is not yet run. This is the initial state of any testcase.

in progress: A test case was started and it is currently executing.

Using the states above, the following changes are proposed (The list is no way complete, use this RFC to add more):

A testsuite that is only being built and can't run will be reported as NOTRUN, right now we report those as SKIPPED
A testsuite that fails to build, whether due to a build error, or do to a linking error (overflow) will be reported as ERROR, right now those are being SKIPPED
- when a test overflows, it needs to be fixed or the target will need to be filtered out. we should not have 100s of test build and fail to link without any resolution of the issue. This could be hiding potential problems and is an overhead when running tests that we can't afford anymore.
SKIPPED will not be used for filtered tests anymore (see above). filtered tests will not be part of default reporting, to get a filter report, we will use existing or add new options and general special reports to analyze filtering. SKIPPED will only be used for intentionally skipped tests
We use a unified metric s for counting PASS rates and this will be based on testcase execution rather than testsuite/scenario or test applications

nashif commented 1 year ago

@hakehuang @PerMac @SzymonRichert @golowanow

nashif commented 1 year ago

prototyping this right now, test summary before:

./scripts/twister -p native_posix -T tests/kernel/threads/ -v
ZEPHYR_BASE unset, using "/home/nashif/zephyrproject/zephyr"
Renaming output directory to /home/nashif/zephyrproject/zephyr/twister-out.9
INFO    - Using Ninja..
INFO    - Zephyr version: zephyr-v3.3.0-4093-g5781478fd5c4
INFO    - Using 'zephyr' toolchain.
INFO    - Building initial testsuite list...
INFO    - Writing JSON report /home/nashif/zephyrproject/zephyr/twister-out/testplan.json
INFO    - JOBS: 112
INFO    - Adding tasks to the queue...
INFO    - Added initial list of jobs to queue
INFO    -  3/10 native_posix              tests/kernel/threads/tls/kernel.threads.tls        SKIPPED (runtime filter)
INFO    -  4/10 native_posix              tests/kernel/threads/thread_error_case/kernel.threads.error.case SKIPPED (runtime filter)
INFO    -  5/10 native_posix              tests/kernel/threads/dynamic_thread/kernel.threads.dynamic SKIPPED (runtime filter)
INFO    -  6/10 native_posix              tests/kernel/threads/thread_apis/kernel.threads.apis.pinonly SKIPPED (runtime filter)
INFO    -  7/10 native_posix              tests/kernel/threads/tls/kernel.threads.tls.userspace SKIPPED (runtime filter)
INFO    -  8/10 native_posix              tests/kernel/threads/thread_init/kernel.threads.init PASSED (native 0.015s)
INFO    -  9/10 native_posix              tests/kernel/threads/thread_stack/kernel.threads.thread_stack PASSED (native 0.047s)
INFO    - 10/10 native_posix              tests/kernel/threads/thread_apis/kernel.threads.apis PASSED (native 0.081s)

INFO    - 10 test scenarios (10 test instances) selected, 7 configurations skipped (2 by static filter, 5 at runtime).
INFO    - 3 of 10 test configurations passed (100.00%), 0 failed, 0 errored, 7 skipped with 0 warnings in 7.13 seconds
INFO    - In total 45 test cases were executed, 38 skipped on 1 out of total 571 platforms (0.18%)
INFO    - 3 test configurations executed on platforms, 0 test configurations were only built.
INFO    - Saving reports...
INFO    - Writing JSON report /home/nashif/zephyrproject/zephyr/twister-out/twister.json
INFO    - Writing xunit report /home/nashif/zephyrproject/zephyr/twister-out/twister.xml...
INFO    - Writing xunit report /home/nashif/zephyrproject/zephyr/twister-out/twister_report.xml...
INFO    - Run completed

after:

./scripts/twister -p native_posix -T tests/kernel/threads/ -v
ZEPHYR_BASE unset, using "/home/nashif/zephyrproject/zephyr"
Renaming output directory to /home/nashif/zephyrproject/zephyr/twister-out.10
INFO    - Using Ninja..
INFO    - Zephyr version: zephyr-v3.3.0-4083-gaa1b892e7999
INFO    - Using 'zephyr' toolchain.
INFO    - Building initial testsuite list...
INFO    - Writing JSON report /home/nashif/zephyrproject/zephyr/twister-out/testplan.json
INFO    - JOBS: 112
INFO    - Adding tasks to the queue...
INFO    - Added initial list of jobs to queue
INFO    - 1/8 native_posix              tests/kernel/threads/dynamic_thread/kernel.threads.dynamic FILTERED (runtime filter)
INFO    - 2/8 native_posix              tests/kernel/threads/tls/kernel.threads.tls.userspace FILTERED (runtime filter)
INFO    - 3/8 native_posix              tests/kernel/threads/thread_apis/kernel.threads.apis.pinonly FILTERED (runtime filter)
INFO    - 4/8 native_posix              tests/kernel/threads/thread_error_case/kernel.threads.error.case FILTERED (runtime filter)
INFO    - 5/8 native_posix              tests/kernel/threads/tls/kernel.threads.tls        FILTERED (runtime filter)
INFO    - 6/8 native_posix              tests/kernel/threads/thread_init/kernel.threads.init PASSED (native 0.016s)
INFO    - 7/8 native_posix              tests/kernel/threads/thread_stack/kernel.threads.thread_stack PASSED (native 0.047s)
INFO    - 8/8 native_posix              tests/kernel/threads/thread_apis/kernel.threads.apis PASSED (native 0.085s)

INFO    - 10 test scenarios (10 test instances) selected, 7 configurations filtered (2 by static filter, 5 at runtime).
INFO    - 3 of 3 test configurations passed (100.00%), 0 failed, 0 errored, 0 warnings in 7.18 seconds
INFO    - In total 45 test cases were executed, 38 skipped on 1 out of total 570 platforms (0.18%)
INFO    - 3 test configurations executed on platforms, 0 test configurations were only built.
INFO    - Saving reports...
INFO    - Writing JSON report /home/nashif/zephyrproject/zephyr/twister-out/twister.json
INFO    - Writing xunit report /home/nashif/zephyrproject/zephyr/twister-out/twister.xml...
INFO    - Writing xunit report /home/nashif/zephyrproject/zephyr/twister-out/twister_report.xml...
INFO    - Run completed

Summary:

There is no confusion between skipped and filtered
3 of 3 test configurations passed (100.00%) instead of 3 of 10 test configurations passed (100.00%) is more clear and intuitive.
filtered tests are reported at such on the screen...

Still need to have a final summary based on testcases rather than suites (which will have also skipped reporting and other things based on above new states).

hakehuang commented 1 year ago

in above prototype, the SKIPED status is replaced by FILTERED, correct? @nashif .

Besides for the INPROGRESS = 'in progress' may be chagned to INPROGRESSING, so that the status is defined as test final mode end with ED test abnormal mode end with a noun, which is only ERROR by now test proceeing status is end with ING

nashif commented 1 year ago

Besides for the INPROGRESS = 'in progress' may be chagned to INPROGRESSING, so that the status is defined as test final mode end with ED

in progress is an intermediate state, any tests ending up with this state should have a final status, in this case we usedto do 'FAILED', but one possible outcome can also be an INCOMPLETE state.

golowanow commented 1 year ago

The following states will be defined:

class Status:
    PASSED = 'passed'
    FAILED = 'failed'
    ERROR = 'error'
    SKIPPED = 'skipped'
    NOTRUN = 'not run'
    BLOCKED = 'blocked'
    INPROGRESS = 'in progress'

I think, to avoid confusion, there should be different classes for what is a Status at execution time, and what is a final State to report. In this way a Test Suite will have its (State, Status) tuple, e.g. the 'only built' suite might have (Passed, Built) or (Error, Building).

Also it would be nice to clarify how Status of a Test Suite corresponds to multiple statuses of its Test Cases.

nashif commented 1 year ago

another update of summary:

INFO    - 2 test scenarios (84 test instances) selected, 24 configurations filtered (12 by static filter, 12 at runtime).
INFO    - 56 of 60 test configurations passed (93.33%), 4 failed, 0 errored, 0 not run, 0 warnings in 198.91 seconds
INFO    - 737 of 744 test cases passed (99.06%), 4 failed, 0 not run, 3 blocked on 42 out of total 570 platforms (7.37%)
INFO    - 60 test configurations executed on platforms, 0 test configurations were only built.

nashif commented 1 year ago

Also it would be nice to clarify how Status of a Test Suite corresponds to multiple statuses of its Test Cases.

yes, this will part of the docs

I think, to avoid confusion, there should be different classes for what is a Status at execution time, and what is a final State to report. In this way a Test Suite will have its (State, Status) tuple, e.g. the 'only built' suite might have (Passed, Built) or (Error, Building).

This might be more confusing. The only state that is internal is the 'in progress' and will not appear in any reports, for everything else we need 1 single final status.

golowanow commented 1 year ago

The only state that is internal is the 'in progress' and will not appear in any reports, for everything else we need 1 single final status.

and the proposed "notrun" is, isn't it ?

Currently , for Test Suite and Test Case's statuses we have something like below possible:

|         Suite:| Filtered | Passed | Error | Failed |
| Case:   ======|==========|========|=======|========|
| Filtered      |    x     |   -    |   -   |   -    |
| Started       |    -     |   x    |   -   |   -    |
| Skipped       |    -     |   x    |   -   |   x    |
| Passed        |    -     |   x    |   -   |   x    |
| Blocked       |    -     |   x    |   x   |   x    |
| Failed        |    -     |   -    |   x   |   x    |

nashif commented 1 year ago

and the proposed "notrun" is, isn't it ?

no, this one is new, see above

A testsuite that is only being built and can't run will be reported as NOTRUN, right now we report those as SKIPPED

Currently , for Test Suite and Test Case's statuses we have something like below possible:
|         Suite:| Filtered | Passed | Error | Failed |
| Case:   ======|==========|========|=======|========|
| Filtered      |    x     |   -    |   -   |   -    |
| Started       |    -     |   x    |   -   |   -    |
| Skipped       |    -     |   x    |   -   |   x    |
| Passed        |    -     |   x    |   -   |   x    |
| Blocked       |    -     |   x    |   x   |   x    |
| Failed        |    -     |   -    |   x   |   x    |

not entirely correct:

|         Suite:| Filtered | Passed | Error | Failed |
| Case:   ======|==========|========|=======|========|
| Filtered      |    x     |   -    |   -   |   -    |
| Started       |    -     |   -    |   -   |   -    |
| Skipped       |    -     |   x    |   x   |   x    |
| Passed        |    -     |   x    |   -   |   x    |
| Blocked       |    -     |   -    |   x   |   x    |
| Failed        |    -     |   -    |   x   |   x    |

nashif commented 1 year ago

Besides for the INPROGRESS = 'in progress' may be chagned to INPROGRESSING, so that the status is defined as test final mode end with ED test abnormal mode end with a noun, which is only ERROR by now test proceeing status is end with ING

Let's avoid the linguistic part of this and not use verbs at all, so

class Status: PASS = 'pass' FAIL = 'fail' ERROR = 'error' SKIP = 'skipp' NORUN = 'no run' BLOCK = 'block' INPROGRESS = 'in progress'

golowanow commented 1 year ago

Currently , for Test Suite and Test Case's statuses we have something like below possible:

not entirely correct:

@nashif, these two combinations (Passed,Started) and (Passed,Blocked) are currently observed when a TC is either passed despite its end record lost, or failed as expected (drivers.coredump.api).

nashif commented 1 year ago

@nashif, these two combinations (Passed,Started) and (Passed,Blocked) are currently observed when a TC is either passed despite its end record lost, or failed as expected (drivers.coredump.api).

yes, could be, there is a bug somewhere, if you can point me to such cases with more details it would be great

PerMac commented 1 year ago

In general we think this is a great initiative. Some comments from our side:

It can be very useful to prepare a flow chart showing all possible statuses and possible paths of changes. We believe such visual aid would allow for a better verification if the transitions between states are happening according to the design, if they make sense, etc.
I don't understand the BLOCK status. From the description it sounds as it is replacing FILTERED. In the current code it is used instead of ERROR. If it is to replace "filtered", what about calling it "DESCOPE", i.e. removed from the scope?
I think we could aim at descriptions, which directly indicate where along the execution path we are. I think IN PROGRESS is too vague, since it would correspond to transition between any two states. Maybe it could be resolved if we keep tuples instead of single strings? Example (last_state, end_state). Such tuple indicates were was the last "checkpoint" and were it was heading .
I think the challenge is that different end product is expected for different "tests". Majority of configurations are expected to be only build in the CI. It is even more challenging, since we have "true" build-only configurations (i.e. those with build-only:true in their yamls.) and ones which are not executed due to the CI limitation (execution is not possible, e.g. no hw, some emu/simulator not installed, etc.). If we keep the "end_state" it might help into translating the internal states into states allowed in a given format (e.g. junit). If a test ended after being built it is fine if it was expected that it will only build but not fine when it was executable but the executed. -not run: A test case or test suite that is not yet run. This is the initial state of any testcase. and A testsuite that is only being built and can't run will be reported as NOTRUN, these statements describe a different state (initial vs built_but_not_executed).
I think not keeping the current "filtered" in the default report is good idea. I think a user can add a CLI flag and get a separate report where such tests will be kept. Such file can be very useful for debugging purpose. I was opting for storing filtered results so we could analyze wat happened. However, keeping them as optional separate file should clear the reports which are flooded with skips. I can imagine (since it already happened) PMs asking: "where is the result of this test?" if those were not stored anywhere.

nashif commented 1 year ago

It can be very useful to prepare a flow chart showing all possible statuses and possible paths of changes. We believe such visual aid would allow for a better verification if the transitions between states are happening according to the design, if they make sense, etc.

Does this really need a flow diagram?

notrun -> in progress -> passed/failed/error                                
notrun -> skipped
notrun -> blocked

I don't understand the BLOCK status. From the description it sounds as it is replacing FILTERED. In the current code it is used instead of ERROR. If it is to replace "filtered", what about calling it "DESCOPE", i.e. removed from the scope?

description changed.

I think we could aim at descriptions, which directly indicate where along the execution path we are. I think IN PROGRESS is too vague, since it would correspond to transition between any two states. Maybe it could be resolved if we keep tuples instead of single strings? Example (last_state, end_state). Such tuple indicates were was the last "checkpoint" and were it was heading .

description changed.

I think the challenge is that different end product is expected for different "tests". Majority of configurations are expected to be only build in the CI.

A test that only builds is a test that was not run, it tests if the code builds, it does not test the functionality. So, saying a test has passed just because we were able to build it wrong and can be confusing.

It is even more challenging, since we have "true" build-only configurations (i.e. those with build-only:true in their yamls.)

IMO those are bugs and shall be removed and fixed.

and ones which are not executed due to the CI limitation (execution is not possible, e.g. no hw, some emu/simulator not installed, etc.).

and those will need us to figure out ways how to test them in CI. For example, all tests that had keyboard harness shall be converted to use pytest and more tests with custom and placeholder harnesses need to move in the same direction. In short, a test/sample that was built, should not be reported as PASSED or FAILED. if is now marked as not run and if the build fails, it is marked as error. For CI purposes, this is perfect.

-not run: A test case or test suite that is not yet run. This is the initial state of any testcase. and A testsuite that is only being built and can't run will be reported as NOTRUN, these statements describe a different state (initial vs built_but_not_executed).

Hmm, not really. The initial state is set after the test was built, so either it runs and then passes or fails, or it remains as not run.

zephyrproject-rtos / zephyr