xtuml / munin

Apache License 2.0
1 stars 0 forks source link

Unhappy jobs cannot fail only alarm. #222

Closed cortlandstarrett closed 3 months ago

cortlandstarrett commented 3 months ago

When an unhappy job fails, it always reports 'svdc_job_alarm'. This is not correct. An unhappy job can fail in all of the same ways that a happy job can fail. Alarm is a special case (critical event seen).

We need to alarm only when a critical event was seen and allow for 'svdc_job_failed' in the "normal" failure cases.

cortlandstarrett commented 3 months ago

This has remained undetected, partly because the only error injection testing that has been done on Critical jobs is the injection of unhappy events in the context of critical events.

cortlandstarrett commented 3 months ago

Resolved with PR #230