zalando / rds-health

discover anomalies, performance issues and optimization within AWS RDS
MIT License
25 stars 4 forks source link

Use SUCCESS RATE as the criteria for health assessment #7

Open fogfish opened 8 months ago

fogfish commented 8 months ago

As a user I want to reduce number of false positive reports so that my workflow is not interrupted for the noise.

For example, The rule engine is only uses absolute values to consider success or failure.

Should(rules.OsCpuUtil.Below(40.0, 60.0))

As a consequence, event if a single sampled value is above threshold the utility report an error. It causes a few false positive. Usage of % of success as criteria would be helpful. In the example below, it would be nice to claim failure if success rate is over 60%.

STATUS       %            MIN            AVG            MAX  ID CHECK
FAILED  32.14%           0.03          13.33         250.61  D3: storage i/o latency
fogfish commented 7 months ago

The success rate is calculated as percentile of tAvg value, which is actually controls the status. Instead of adding extra config parameter, we should find better ways of educating on configuration. Visualising raw metrics would be better.