weka / wekachecker

Validates hosts are ready to run Weka
GNU General Public License v3.0
2 stars 0 forks source link

Cst jack statistical outlier #127

Closed jackchallen closed 1 month ago

jackchallen commented 1 month ago
Weka stats can provide some useful values. It's sensible to go
looking at these values for enormous outliers. There are a few
methods to calculate outliers, but there's no persistently good
measure that can be applied to all of our statistics.
I've looked at (and tested) interquartile range and
the Jarque-Bera method, but both of these perform horribly with
say PUMPS_TXQ_FULL - producing values of 100-1e7 across a few
different clusters, and there's no sane way to assess that.

I don't think there's going to be a single correct answer that
covers all of our statistics, but for now I've just gone with
looking at a multiplier of the standard deviation. In the case
of PUMPS_TXQ_FULL, a value that's 10x outside the standard
deviation of all other backends is quite possibly an outlier indeed.

This will probably need tuning.