twister: handle exceptions when running binaries

nashif commented 4 weeks ago

Deal with binaries not found. Also set default return code to something other than zero to catch issues and report correct status.

Fixes #72807

Signed-off-by: Anas Nashif anas.nashif@intel.com

hakehuang commented 3 weeks ago

similiar with https://github.com/zephyrproject-rtos/zephyr/pull/72892, but the fix in this PR looks more considerate

nashif commented 3 weeks ago

If I read this correctly, this solution is even less general than the one in #72892: at least that one caught exceptions in any Handler, this one is specific to the BinaryHandler.

ok, makes sense. Updated to be more general.

arbrauns commented 3 weeks ago

I still feel like there's something missing. Previously run() threw an exception, which then killed the entire runner, right? Presumably that caused a python process to exit with a nonzero exit code somewhere, right? Where did that happen and why isn't it caught?

arbrauns commented 3 weeks ago

Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in process(), it was being passed on to the calling function as an exception as it should be.

PerMac commented 3 weeks ago

Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in process(), it was being passed on to the calling function as an exception as it should be.

I am still not sure what are you referring to. With the current version twister will exit with exitcode==1, because there were errors in the run: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/twister_main.py#L236 Is this what you mean by "fail the CI run"?. We don't want to "crushing" the run, i.e. abruptly exiting on an error, since we want the run to finish to see all test results.

arbrauns commented 3 weeks ago

The original problem was that the runner crashed (the reason doesn't matter), half the tests weren't being run, but CI was still completely green. With the changes in this PR, an exception being thrown by self.run() is now caught and the status is reported as error, but if the runner crashes anywhere else for any other reason, the same problem of CI ending up completely green still persists, right?

We don't want to "crushing" the run, i.e. abruptly exiting on an error, since we want the run to finish to see all test results.

If the runner crashes in a way that it can't handle and report by itself, then that is exactly what needs to happen. Catastrophic failure needs to result in more errors, not no errors.

Let's approach it like this: which piece of code starts the runner? Why doesn't it notice and report when the runner crashes?

PerMac commented 3 weeks ago

I believe I finally understood what do you mean. I saw the original error in our CI, but since there are several workers (I think this is what you call runner?) running it didn't look like something was missing, beside those unit tests. In logs I saw errors (not reported), but it looks like it didn't skip on running anything else. However, there was a huge scope, varying in between runs, so it was impossible to see if sth was not executed. I was able to reproduce what you're talking about locally, calling

scripts/twister --verbose -T tests/unit/ -p unit_testing -p qemu_x86 -T samples/hello_world/ -j 1 --shuffle-tests --subset 1/1

After the worker failed, nothing more was executed, and summary was printed without reporting errors (although passed was also 0, but who looks at it) and twister exit with 0. I think I was able to resolve the issue by wrapping this line with try except and doing an instance status reporting: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/runner.py#L1325 IMO such fix can be yet another PR. I think I will also have to add new log file, e.g. "pipeline.log" and dump the error there. Also to pass errors from this file to the report (what I also asked from this PR). @arbrauns does it sound aligned with your issue?

arbrauns commented 3 weeks ago

I think what should happen is that twister simply exits with a nonzero exit code. If a worker crashes so hard that it can't even report back its failure, it's fine to kill the CI run, investigate and fix the problem, then retry the CI run. It shouldn't happen often.

PerMac commented 3 weeks ago

@arbrauns What about this? https://github.com/zephyrproject-rtos/zephyr/pull/72945 I tested it on main with reverted unit_testing fix and instead of silently crashing I got twister going over every test, reporting pipeline error and dumping an error traceback into results report. If you also want to see the error in online logs you can ran twister with -v --inline-logs. This is the standard for all other errors/failures. I think it is better if we can handle such errors without crashing.

When I added also #72893 into the mix I was no longer getting Pipeline errors, since those unit_testing errors where already handled.

nashif commented 3 weeks ago

@arbrauns What about this? #72945

Well, those are not pipeline errors, not sure what this means exactly, as the bug fix for the binary shows, it was an issue in twister itself trying to run the wrong binary, something we should abort on and stop execution ASAP. So I agree with @arbrauns completely and the latest revision in this PR just does that, stop twister and do not try to push this into logs an pretent all is good.

arbrauns commented 3 weeks ago

@PerMac

This is the standard for all other errors/failures. I think it is better if we can handle such errors without crashing.

Of course it's better if it's handled gracefully, but the core problem here is that errors which aren't being handled gracefully (of which there are infinitely many, you can't possibly cover all error scenarios) simply vanish without a trace. #72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem.

@nashif after looking into it deeper, I think what's actually missing is a check for Process.exitcode in TwisterRunner.execute after the .join(). That's where the exit code of the runner is being swallowed.

nashif commented 3 weeks ago

@nashif after looking into it deeper, I think what's actually missing is a check for Process.exitcode in TwisterRunner.execute after the .join(). That's where the exit code of the runner is being swallowed.

yes, but this will be very difficult to handle dealing with multiprocessing, I could check for this and exist but it will also be nice to catpture the backtrace and error, not only react on an exit code.

PerMac commented 3 weeks ago

@arbrauns @nashif

"https://github.com/zephyrproject-rtos/zephyr/pull/72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem."

It is not a bandaid for one error case, but the whole set of exceptions you want to catch if you don't want to be specific https://docs.python.org/3/library/exceptions.html#exception-hierarchy and from PEP8: "A bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use except Exception: (bare except is equivalent to except BaseException:)." https://peps.python.org/pep-0008/#programming-recommendations:~:text=If%20you%20want%20to%20catch%20all%20exceptions%20that%20signal%20program%20errors%2C%20use%20except%20Exception%3A

The solution from this PR uses the exact same mechanism at the same line i.e. wrapping pb.process from L1325 with try/except Exception. My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward. This PR instead adds a result to the results_queue and adds new mechanism of scanning the results_queue for error and exiting when one is found. @nashif what is the benefit of doing this extra queuing and scanning, instead of just making twister to exit with error, as a part of except Exception? Also, something is amiss with the current PR if the idea was to exit on first error. Now Twister will check the existence of errors only after the whole set is executed. E.g. try

git revert 7d3dc4e7e377746ce752390082e634120b7edb8e
scripts/twister -v -T tests/unit -p unit_testing

It will go over all 62 configurations, erroring on each, and only then error-exit (without any summary). My PR will set results to errors for all those cases, put according error traces in reports (one can see full error in twister logs with inline-logs, which is the default workflow), and have a graceful exit with a summary and exit code==1 (due to errors in results).

nashif commented 3 weeks ago

My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward.

The main problem here is the fact that the "errors" or failures are not test related but issues in the runner, and test should not generate an exception in the python code of the runner. Just counting those as test errors because of wrong logic or something gone wrong in twister is not a good idea, it should just quit and a fix need to be provided.

PerMac commented 3 weeks ago

@nashif I am fine with such approach, just proposed a way to see what was the problem and still be able to execute the whole scope. My +/-1 refers to the fact that your current implementation checks the queue only after every test in the scope is evaluated. So instead of quitting on the first pipeline error, one has to wait potentially till the end to just see twister crushing.

nashif commented 2 weeks ago

@LukaszMrugala @KamilxPaszkiet can you help with the failing test please?

nashif commented 1 week ago

@gchwier @golowanow @gopiotr @KamilxPaszkiet @LukaszMrugala @PerMac @hakehuang @arbrauns

please review....

zephyrproject-rtos / zephyr

twister: handle exceptions when running binaries #72893