[RFC] How to automate identifying known test failures

rolfbjarne commented 6 years ago

We have frequent test failures in our CI, both random and other types. This takes a significant amount of manpower to diagnose, so automating it somehow would be quite beneficial.

On the other hand, it's often not easy to determine if a failure is already a known issue or not, since failures come in all shapes and sizes. On the more extreme end it could end up becoming some sort of AI project...

Ground rules

Any actions taken or information computed should be reflected in the html report, with verbose logs whenever it makes sense.
All known test failures will be issues in the xamarin-macios, with a specific tag (the maccore repo is private, which means accessing it from our public Jenkins bot to list known test failures is not ideal).
Each issue will contain data in a format that can be easily consumed by xharness, and that is used to match any failures against all the reported known failures. Currently I'm thinking that attaching a file to the issue could work, but this might not be feasible (can the file be updated with new information? maybe we run into GH's rate limiting), so other options might have to be investigated (link to gist or file on azure maybe?), but I'd like to avoid using comments (maybe except the initial description) so that they can remain for humans.
Each issues should be as specific as possible. Filing generic issues that happen to catch multiple unrelated failures is bad.
Whenever xharness matches a known failure issue to a test failure, a comment pointing to the failure (to the html report) is added to the issue. This way we keep track of which known failures are most common (and whomever is CC'ed on the issue will be spammed about the failures until they give up and fix the problem 😉).
xharness will probably have to search closed issues as well, because fixing an issue in master doesn't mean it won't happen in other branches. One way to limit the issues to search would be to set a time limit (only search open issues and issues closed within the last 6 months for instance).
Matching a known issue to a test failure is separate from determine how that known issue should be treated: it should still be possible to fail the tests (make them orange), while at the same time make it clear in the html report that a known issue was identified for the failure (making reviewing the failures much simpler).
xharness should support rerunning a test as the result of finding a known issue.

AI rules

This is a list of rules about how to match test failures with known issues.

Build failures

Textual match for a particular string (error number / message) in the build log.

NUnit test failures

Match against test name.
Match against test error message as long as the error message is consistent. Regular expressions should be supported to be able to ignore certain simple differences (temporary file paths for instance).
It should be possible to match multiple test failures in the same issue, since a single problem can cause multiple tests to fail.

Examples:

Test execution problems

Execution timeouts (when the entire test run times out): not sure about the best way to match this. Textual match against stdout maybe.

Examples:

Test crashes

Maybe textual match against something in crash reports? Example text to match: https://github.com/xamarin/xamarin-macios/issues/3811#issuecomment-378579823.

Examples:

Other failures

Ideas for more rules welcome :smile:

Data format

Xml file, with fairly simple syntax to execute.

<known-issues version="1.0">
    <known-issue description="human readable description" action="rerun|ignore|fail">
        <condition>
            <and> <!-- matches if all nested conditions -->
                <match testresult="BuildFailure" />
                <or> <!-- matches if any of the nested conditions match -->
                    <match logname="Build log" containsText="CSC0123: The Doctor failed to save Earth." />
                    <match logname="Build log" containsRegularExpression="CSC0124: The TARDIS was lost in time in the year [0-9]*." />
                </or>
                <not>
                    <match testresult="Success" />
                </not>
            </and>
        </condition>
    </known-issue>
</known-issues>

To ease writing data files, it should be possible to execute/validate them against an existing html report.

spouliot commented 6 years ago

let's start with goals and a bit of context

Goals

Anyone (not just the core team) should be able to know if a build is usable or not;
The build results needs to be accurate, i.e. representative of the build quality;

Today we have a system that relies on colours. That's fine because it's instinctive.

green is safe;
orange is risky;
red is broken;

However for many (good, bad and ugly) reasons the majority of wrench builds ends up being orange (it's better for PR on Jenkins). The biggest reason for the orangeness are random, known issues. Even low frequency random issues happens frequently when we have more than 100k gates that can turn a build to orange.

Core team members can identify safe builds (the majority) because we, as a policy, investigate build failures and file issues on them (for tracking purposes). This is time consuming (for repeated offenders) and does not help everyone to quickly identify a good build.

spouliot commented 6 years ago

The goals are ambitious (as much as we want accuracy anyway) but I think it can start small and expand as needed, i.e. if things gets too complex then we need to question (and invest in) the tests.

Every additional green build free us some time to fix something else (instead of reviewing logs). So if we can cheaply solve 2 of out 3 cases then our tree will be largely green and that would solve 90% of the problem (and 99% of the complaints).

A good example is tonight's https://github.com/xamarin/xamarin-macios/pull/3918

apitest/Mac Unified XM45 32-bit: TimedOut (Execution timed out after 1200 seconds.)

That, if a known issue [1], could easily be ignored [2] or at least give a link to the suspected known issue (but that goes back to #3909)

[1] right now it's an hard problem because we don't have a common/unique way to identify them. Luckily it's more a human problem than a technical one [2] maybe it should not be, it's a bit general message

There's some proposals that are really great, e.g.

xharness should support rerunning a test as the result of finding a known issue.

but the amount of work to get there seems a lot higher than ignoring known issues.

Also should it be xharness ? or something else that filters the results. The later would mean the logic could exists outside the repo (and branches) which has both pros and cons. We already have the data out of the repo...

Finally how can the TARDIS be lost in time if you know the year it's lost ? Yet another example why regex don't make sense ;-)

xamarin / xamarin-macios