Open rolfbjarne opened 6 years ago
let's start with goals and a bit of context
Goals
Today we have a system that relies on colours. That's fine because it's instinctive.
However for many (good, bad and ugly) reasons the majority of wrench builds ends up being orange (it's better for PR on Jenkins). The biggest reason for the orangeness are random, known issues. Even low frequency random issues happens frequently when we have more than 100k gates that can turn a build to orange.
Core team members can identify safe builds (the majority) because we, as a policy, investigate build failures and file issues on them (for tracking purposes). This is time consuming (for repeated offenders) and does not help everyone to quickly identify a good build.
The goals are ambitious (as much as we want accuracy anyway) but I think it can start small and expand as needed, i.e. if things gets too complex then we need to question (and invest in) the tests.
Every additional green build free us some time to fix something else (instead of reviewing logs). So if we can cheaply solve 2 of out 3 cases then our tree will be largely green and that would solve 90% of the problem (and 99% of the complaints).
A good example is tonight's https://github.com/xamarin/xamarin-macios/pull/3918
apitest/Mac Unified XM45 32-bit: TimedOut (Execution timed out after 1200 seconds.)
That, if a known issue [1], could easily be ignored [2] or at least give a link to the suspected known issue (but that goes back to #3909)
[1] right now it's an hard problem because we don't have a common/unique way to identify them. Luckily it's more a human problem than a technical one [2] maybe it should not be, it's a bit general message
There's some proposals that are really great, e.g.
xharness should support rerunning a test as the result of finding a known issue.
but the amount of work to get there seems a lot higher than ignoring known issues.
Also should it be xharness ? or something else that filters the results. The later would mean the logic could exists outside the repo (and branches) which has both pros and cons. We already have the data out of the repo...
Finally how can the TARDIS be lost in time if you know the year it's lost ? Yet another example why regex don't make sense ;-)
We have frequent test failures in our CI, both random and other types. This takes a significant amount of manpower to diagnose, so automating it somehow would be quite beneficial.
On the other hand, it's often not easy to determine if a failure is already a known issue or not, since failures come in all shapes and sizes. On the more extreme end it could end up becoming some sort of AI project...
Ground rules
AI rules
This is a list of rules about how to match test failures with known issues.
Build failures
NUnit test failures
Examples:
Test execution problems
Examples:
Test crashes
Examples:
Other failures
Data format
Xml file, with fairly simple syntax to execute.
To ease writing data files, it should be possible to execute/validate them against an existing html report.