Open cortlandstarrett opened 1 week ago
This implies that a transition from JobFailed to HorriblyWrong is needed, so that a failed job can be upgraded to alarm when a critical event arrives after failure.
For 1.4.0, the timer to end an unhappy job shifted from the JobHanging timer to the InvariantLoad timer.
In the future, a separate, purpose-specific timer should be added.
At present, an unhappy job finishes after a timeout of the hanging job timer. The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.
There are a few problems with this:
It might be good to add a configuration value for this timer. Another option is to use the intra-event timer which can be very short.
A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.